ABSTRACT—Most sets are performed to detect the deviant

ABSTRACT—Most of the Outlier detection
algorithms in data mining are used to find outliers in static databases. Those algorithms
are generally inappropriate for detecting outliers in dynamic databases where
data continuously arrives in the form of streams such as sensor data.
Association rule based outlier detection method can be applied to streamed data
where frequent item sets are evaluated internally. One of the outlier detection
approaches used for static databases include clustering based method, where
K-means clustering algorithm is used internally for discovering outliers from
various static databases. In this paper, we propose two approaches for outlier
detection. One is to use association rules based method for dynamic databases
and the other is to use pruning based local outlier detection method, which
internally uses K-means clustering method for static databases. Experiments on
various data sets are performed to detect the deviant data effectively in fewer
computations.

Keywords—outlier detection,
static databases, dynamic databases, association rules, k-means clustering.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

1. INTRODUCTION

An outlier is
referred to be a data point or an object which differs from other data points
or objects within a data set. One of the important data mining techniques
commonly used in detecting deviant data present in a data set is Outlier
detection. Examples of outlier detection includes credit card fraud detection,
network intrusion detection, stock market analysis, etc., where efficient
algorithms are used in resolving those problems.

The characteristic
feature of streamed data is that data arrives continuously and in a time based
order. Hence they are used in various applications like feeding of sensor data,
measurement of network traffic. Generally, various outlier detection algorithms
are applied on static databases. Because their work is to check whole data sets
in an iterative manner to find top outliers or global outliers. But detecting
outliers in streamed data is difficult for two reasons:

1.     
The data sets in streamed data
are unbounded, which prevents scanning the entire data stream.

2.     
Streamed data require quick
responses for their request and the process has to be completed in a limited
time. Multiple scans for the data stream are not possible.

In static
databases, distance-based techniques are used to model the data points and
outliers are determined. Various distance functions used are Euclidean
distance, Manhattan distance and Minkowski distance. In clustering based
outlier detection method, K-means algorithm is the most popular one, where the
data set is to be divided into clusters. The objects or data points which lie
near to the centroid of the cluster are assigned to the cluster and those
objects or data points which lie far away to the centroid of the cluster are
the outliers. While evaluating the degree of outlierness of an object, the
objects within a cluster are obtained, and they are pruned from each cluster
during outlier detection. This results in the reduction of computational time.

In this paper, we
have proposed two approaches for outlier detection, one involving streaming
data and the other involving static databases. In the first approach, an
algorithm employing association rules to check streamed data for find out the
outliers is proposed. This algorithm defines a bounded set of transactions for
a data stream, which in turn derives association rules from the transactions.
The transactions containing outliers will be evaluated based on those rules.
For instance, if {P,Q} à {R} is an association rule with a
high confidence value of 85%, then a transaction

is an outlier
because the item R is not appearing in the transaction. A transaction is said
to contain an outlier, if it does not contain an item which it is supposed to contain
based on the high confidence association rule.

In the second
approach, a static database is taken and among various data points, a distance
function is applied on them and various clusters are formed. The objects which
lie near to the centroid of the cluster are pruned to evaluate outliers. Then a
distance-based measure i.e., a Local Distance-based Outlier Factor (LDOF) is
applied to the remaining objects, whose task is to find whether an object is an
outlier or not. Once n outliers in a
data set are found, then the top n
objects among n are reported as
outliers.

The remaining
sections of this paper are organized as follows: Section 2 provides related
works on outlier detection. Section 3 describes an association rule based
outlier detection method. Section 4 describes a pruning based local
distance-based outlier factor outlier detection method. Section 5 provides
experimental results. Section 6 concludes the paper.

2. RELATED WORK

A. Data Models in Streamed Data

Streamed data when
compared with the traditional static databases contain unbounded data sets and
it is difficult to choose a finite number of transactions 1 for mining. In
order to resolve this problem, two data models were proposed:

1.     
The accumulative data model
which takes the transactions from the starting to the current time into
consideration. If any new transaction arrives, then it is accumulated into the
database.

2.     
The sliding window model which
is used to form a bounded data set of transactions. When mining is to be
performed, then only the transactions that are in the sliding window are taken
into consideration. The results of mining are updated as soon as the window
passes by.

B.
Association Rules and Outliers in Streamed Data

The estDec method
3 is considered which internally uses a sliding window model, and it finds
frequent item sets within the current window. Initially, the estDec method
performs a check on the transactions that are currently in the sliding window
and builds a prefix-tree P by
considering the frequent item sets. Once the window slides to the next bucket,
then the estDec method finds new frequent item sets and it performs either
insertion or deletion of item sets into P
and the process continues iteratively. To obtain association rules, the
prefix-tree is traversed for obtaining subsets of frequent item sets. A
traversal stack is maintained to derive association rules, which stores a
frequent itemset in the prefix-tree 1 and derives all possible association
rules from the itemset.

A transaction 1
is said to contain an outlier if some items were supposed to be present in a
transaction but they are actually not present there. To evaluate 1 whether a
transaction has an outlier, an outlier degree is defined.

C. Distance-Based Outlier Detection Techniques

Knorr and Ng 4
introduced a definition to distance-based outlier detection techniques 2. An object x in a data set D is a DB(y, dist)
– outlier if at least fraction y of the objects in D lie at a greater distance
than dist from x 2.

Angiulli and
Pizzuti 5-7 proposed an outlier detection method based on the concept of
objects’ neighbourhood, where outliers are determined after ranking each object
based on calculating sum of the distances from its k-nearest neighbors.

Breunig, Kriegel
and Ng 8 proposed a new definition named as Local Outlier Factor (LOF) to
each object, which gives its degree of outlierness. To evaluate LOF for an
object, a restricted neighbourhood of it is considered by comparing the density
of the object with its neighbors.

Zhang, Hutter and
Jin 9 proposed a local distance-based outlier detection method for finding
outlier objects. For every object, the local distance-based outlier factor 2
(LDOF) evaluates the degree to which it deviates from its nearest objects.
Calculation of LDOF for all the objects is complex i.e., O(N2), where N
specifies number of objects in a data set.

ough set theory
3 was first proposed by Z. Pawlak, which is used to deal with uncertain,
vague and incomplete data. The main idea is based on the indiscernibility
relation that describes the indistinguishability of objects. If one cannot
represent an object in the form of a crisp set, then such object can be
represented by a rough set with the approximate representation of knowledge. A
rough set gives the lower and upper approximation of the set that has
incompletely specified knowledge.

3.
ASSOCIATION RULE BASED OUTLIER DETECTION METHOD

A. Overview

Based on the 10
work, a sliding window is taken for making a bounded data stream. After
considering transactions within the window, a prefix-tree is built and then
association rules are derived from the tree by traversing it iteratively in
multiple scans. Once a transaction’s items are not found in the association
rules list, then such transaction may be an outlier.

The following
figure shows a sample traversal stack

containing items
and their counts, with the arrow indicating the top of the stack:

item

Count

in

C(i1,i2,….in)

in-1

C(i1,i2,….in-1)

in-2

C(i1,i2,….in-2)

i2

C(i1,i2)

i1

C(i1)

Figure 1. A sample
traversal stack

The following
definitions are regarding the outlier degree given by Narita and Katigawa in
10:

1 Definition 1. Let t be a transaction and R
be a set of association rules, and then t’s
associative closure t+ is
given as below:

                                                                      
                                                                  (1)

1 Definition 2. Let t
be a transaction and R be a set of
association rules, and t+ is
the t’s associative closure. Then the
outlier degree of t is given as
below:

                                                            
OutlierDegree(t) = ||/||                                        (2)

The range of outlier degree lies between
0 and 1.

Definition 3. A transaction is said to be an outlier transaction if its
outlier degree is greater than or equal to minimum outlier degree, which is
considered to be a threshold.

The following steps
provide a way to detect outliers in transaction data sets:

1.     
Build a prefix tree
that keeps track of frequent item sets within a sliding window.

2.     
Derive a set of
association rules with high confidence value from the prefix tree.

3.     
Evaluate outlier
transactions from the transaction set within the sliding window.

4.     
Divide outlier
transactions into two parts, one containing unobserved frequent item sets and
the other containing infrequent items.

B. Example

Let us consider a
transaction data set 1 consisting of 10 transactions in a sliding window,
each consisting of the items 1 Book, Joke, milk, bacon, Chocos, egg. The
following table gives the 10 transactions and the items in each transaction:

 

 

 

1 Table 1. A
Sample Transaction Data set

TID

Items

1

Book, Joke, Milk

2

Book, Chocos,
Joke, Milk

3

Book, Joke, Milk

4

Bacon, Book,
Chocos, Egg, Milk

5

Bacon, Book,
Chocos, Egg, Joke, Milk

6

Book, Chocos,
Joke, Milk

7

Bacon, Book,
Egg, Milk

8

Bacon, Book,
Egg, Joke, Milk

9

Book, Joke, Milk

10

Bacon, Egg, Milk

 

For deriving
association rules from table 1, the values of minimum confidence and minimum
support are set to 80% and 50% respectively. The association rules derived
after setting the above values are given in table 2, where each association
rule has a high confidence value.

1 Table 2.
Association Rules Derived from Table 1

RID

Rule

1

{Joke} à {Book}

2

{Joke, Milk} à {Book}

3

{Joke} à {Book, Milk}

4

{Bacon} à {Egg}

5

{Bacon, Milk} à {Egg}

6

{Bacon} à {Egg, Milk}

7

{Milk} à {Book}

 

From (1), the association
closure t+ is for transaction 2 will be . From (2), the outlier degree for transaction 2 will be
0.33.

According to
definition 3, 1 if the minimum outlier degree is set to 0.3, then transaction
2 will be an outlier. Similarly, transaction 10 will also be an outlier since
it does not follow any of the high confidence association rule given in table
2.

4.
PRUNING BASED LDOF OUTLIER DETECTION METHOD

A. Overview

Based on the 9
work, we are using Local Distance-based Outlier Factor (LDOF) measure which
specifies by how much an object is deviating from its neighbours. If LDOF value
is high, then such an object is said to be more deviating from its neighbours
and it can be considered to be an outlier. For an object p, the LDOF value can be given as:

                                                               
        LDOF (p) =                                                (3)

 where  specifies the k-nearest neighbour distance of p
and  specifies the k-nearest neighbour inner distance of p.

For an object p, the  distance is equal to the average distance from
p to all objects in its nearest
neighbours’ set.

                                                             =                                           (4)

For an object p, the  distance is equal to the average distance
among all the objects in its nearest neighbours’ set.

                                                          
 =                         (5)

B. Procedure

The major drawback
of LDOF algorithm is that it is more expensive with respect to the
computations, because for each object in the data set we have to calculate
LDOF. To resolve this drawback, we have proposed a K-means algorithm to form
clusters. After clusters are formed, radius for each cluster is calculated and
the objects whose distance from the centroid is less than the radius of the
cluster are pruned. Then LDOF for the remaining (unpruned) objects is calculated,
which reduces the total computations. Among the objects, the top-n objects with higher LDOF values are
considered as outliers.

The following
steps are involved in performing the pruning based LDOF outlier detection
algorithm:

1.     
Formation of clusters: Here we consider
the entire data set and apply K-means algorithm to generate clusters and the
radius for each cluster is calculated.

2.     
Clusters with minimal objects: If a
cluster is found to contain less number of objects than the number of outliers
required, then pruning is avoided in the cluster.

3.     
Pruning objects within a cluster:
Distance for each object in a cluster from its centroid is calculated. If the
distance is less than the cluster radius, then the object is pruned.

4.     
Evaluation of outliers: For the unpruned
objects in each cluster, LDOF is calculated and the ton-n objects with high LDOF value are declared as outliers.

The following
algorithm gives the steps involved in pruning based LDOF outlier detection
algorithm:

begin
OutlierDetectio(DS,c,it,n)

set
X ß Kmeans(c,it,DS)

for each cluster Cj in X do

   Radiusj
ß radius(Cj)

end for

if |Cj| > n then

   for
each object pi in Cj do

            if
distance(pi,oj)