A practical method of finding anomalies within the set of activities using the Bag-of-words model.
I am going to show in detail one unsupervised learning. The major use case is behavioral-based anomaly detection, so let's start with that. Imagine you are collecting daily activity from people. In this example there are 6 people (\( S_1 - S_6 \)). When all the data are sorted and pre-processed, then result might look like this list.
\( S_1 \) is set of the daily activity of the first person, \( S_2 \) of the second one and so on. If you look at this list, then you can pretty easily recognize that activity of \( S_6 \) is somehow different from the others. That's because there are only 6 people. What if there were 6 thousand? Or 6 million? Unfortunately there is no way you could recognize the anomalies. And that's what machines can do. Once a machine can solve such problem in a small scale, then it can usually handle the large scale relatively easy. Therefore the goal here is to build an unsupervised learning model which will identify the \( S_6 \) as an anomaly.
What is this good for? Let me give you 2 examples.
The first example is traditional audit log analysis for the purpose of suspicious activity detection. Let's look at e-mail. Almost everyone has his own usage pattern on day-to-day basis. If this pattern suddenly changes, then this is considered "suspicious". It might mean that someone has stolen your credentials. And it can also mean that you just changed your habit. Machines can't know the underlying reason. What machines can do is analyze millions of accounts and pick up only the suspicious ones, which is typically a very small number. Then the operator can manually call to these people and discover what is going on.
Or imagine you are doing pre-sales research. You employ an agency to make a country-wise survey. And there is a question like 'Please give us 40-50 words feedback'. Let's say you have got 30,000 responses which satisfies the length. Now you want to choose the responses which are somehow special. Because they might be extremely good, extremely bad, or just interesting. All of these give you valuable insight and possibly direction for the future. Since the overall amount is relatively high, then any human would certainly fail in this job. For machines, this is just a piece of cake.
Now let's look at how to teach the machine to do the job.
Example source code is available here. It is a java maven project. Unpack it into any folder, compile by 'mvn package', and run by executing 'java -jar target/anomalybagofwords-1.0.jar 0.5 sample-data-small.txt'. If you run the program this way, it will execute the described process over the cooked data set and identifies \( S_6 \) as an anomaly. If you want to drill down the code, then start with 'BagofwordsAnomalyDetectorApp' class.
Let's briefly establish useful terminology.
Bag of words is a set of unique words within a text, where each word is paired with the number of its occurrence. One specific point is that the order of words is ignored by this structure. If word is not presented in the text, then its occurrence will be considered as \( 0 \). For example bag of words for 'eat, read book, ride bicycle, eat, play computer games, write homework, read book, eat, brush teeth, sleep' can be written as a following table.
Word | Number of occurrences |
---|---|
eat | 3 |
read | 2 |
book | 2 |
ride | 1 |
bicycle | 1 |
play | 1 |
computer | 1 |
games | 1 |
write | 1 |
homework | 1 |
brush | 1 |
teeth | 1 |
sleep | 1 |
Next term is a distance between 2 bags of words. Distance will be written as \( |B(x) - B(y)| \) and is calculated as a sum of absolute values of the differences for all words appearing in both bags. Following the example. $$ |B(read\ article\ and\ book) - B(write\ book\ and\ book)| = \\ = \left| \left( \begin{array}{cc} read & 1 \\ write & 0 \\ article & 1 \\ and & 1 \\ book & 1 \end{array} \right) - \left( \begin{array}{cc} read & 0 \\ write & 1 \\ article & 0 \\ and & 1 \\ book & 2 \end{array} \right) \right| = \textbf{4} $$ Applying this definition, you can calculate the distance between all the example sequences. For example \( |B(S_1) - B(S_2)| = \textbf{12} \) and \( |B(S_1) - B(S_6)| = \textbf{30} \). The latter is higher, because the \( S_1 \) and \( S_6 \) are more different in words than \( S_1 \) and \( S_2 \). This is analogy to the distance between 2 points in the space.
Last term is probability density function. Probability density function is a continuous function defined over the whole real numbers space which is greater or equal to zero for every input and integral over the whole space is 1. Notation \( P(x) \) will be used. More formally this means the following. $$ P(x) \ge 0 \quad \forall x \in \mathbb{R} \\ \int_{\mathbb{R}}P(x) = 1 $$ Typical example of probability density function is normal distribution. Example source code is using more complex one, called normal distribution mixture. Parameter \( x \) is called random variable. In very simplistic way, the higher \( P(x) \) is, the more "likely" variable \( x \) is. If \( P(x) \) is low, then variable \( x \) is falling away from the standard. This will be used when setting up the threshold value. Finally let's make a note about how to create a probability density from finite number of random variables. If \( [x1,...,x_N] \) is the set of N random variables (or samples you can collect), then there is a process called estimation which transforms this finite set of numbers into a continuous probability density function \( P \). Explanation of this process is out of scope for this article, just remember there is such thing. In particular, attached example is using a variation of EM algorithm.
Now it's a time to explain the process. The whole process can be separated into 2 phases called training and prediction. Training is the phase where a all the data is iterated through and relatively small model is produced. This is usually the most time consuming operation and outcome is sometimes called predictive model. Once model is prepared, then prediction phase comes into place. In this phase an unknown data record is examined by model. Next let's drill down the details.
There are required 2 inputs for the training phase.
The whole process is pretty straightforward and you can find the implementation in the source code, class BagofwordsAnomalyDetector, method performTraining.
When training phase finishes, then model is ready to be used in prediction phase.
This is the phase when potentially unseen activities are tested by model. Model evaluates them and returns whether activities are considered as an anomaly or not.
The whole process works for each activity \( S_U \) separately, U stands for "unknown". And it can be summarized by these points.
You have learned about relatively simple model for identifying unusual sequence from a bulk of them. Now you can play with source code, try different variations and see how this affect the result. Here are few ideas to start with.