## Anomaly detection using the Bag-of-words model

I am going to show in detail one unsupervised learning. The major use case is behavioral-based anomaly detection, so let’s start with that. Imagine you are collecting daily activity from people. In this example there are 6 people $$S_1 – S_6$$ When all the data are sorted and pre-processed, then result might look like this list.

• $$S_1 =$$ eat, read book, ride bicycle, eat, play computer games, write homework, read book, eat, brush teeth, sleep
• $$S_2 =$$ read book, eat, walk, eat, play tennis, go shopping, eat snack, write homework, eat, brush teeth, sleep
• $$S_3 =$$ wake up, walk, eat, sleep, read book, eat, write homework, wash bicycle, eat, listen music, brush teeth, sleep
• $$S_4 =$$ eat, ride bicycle, read book, eat, play piano, write homework, eat, exercise, sleep
• $$S_5 =$$ wake up, eat, walk, read book, eat, write homework, watch television, eat, dance, brush teeth, sleep
• $$S_6 =$$ eat, hang out, date girl, skating, use mother’s CC, steal clothes, talk, cheating on taxes, fighting, sleep

$$S_1$$ is set of the daily activity of the first person, $$S_2$$ of the second one and so on. If you look at this list, then you can pretty easily recognize that activity of $$S_6$$ is somehow different from the others. That’s because there are only 6 people. What if there were 6 thousand? Or 6 million? Unfortunately there is no way you could recognize the anomalies. And that’s what machines can do. Once a machine can solve such problem in a small scale, then it can usually handle the large scale relatively easy. Therefore the goal here is to build an unsupervised learning model which will identify the $$S_6$$ as an anomaly.

What is this good for? Let me give you 2 examples.

The first example is traditional audit log analysis for the purpose of suspicious activity detection. Let’s look at e-mail. Almost everyone has his own usage pattern on day-to-day basis. If this pattern suddenly changes, then this is considered “suspicious”. It might mean that someone has stolen your credentials. And it can also mean that you just changed your habit. Machines can’t know the underlying reason. What machines can do is analyze millions of accounts and pick up only the suspicious ones, which is typically a very small number. Then the operator can manually call to these people and discover what is going on.

Or imagine you are doing pre-sales research. You employ an agency to make a country-wise survey. And there is a question like ‘Please give us 40-50 words feedback’. Let’s say you have got 30,000 responses which satisfies the length. Now you want to choose the responses which are somehow special. Because they might be extremely good, extremely bad, or just interesting. All of these give you valuable insight and possibly direction for the future. Since the overall amount is relatively high, then any human would certainly fail in this job. For machines, this is just a piece of cake.

Now let’s look at how to teach the machine to do the job.

Example project (button for download is right above this paragraph) is a standards java maven project. Unpack it into any folder, compile by ‘mvn package‘, and run by executing ‘java -jar target/anomalybagofwords-1.0.jar 0.5 sample-data-small.txt‘. If you run the program this way, it will execute the described process over the cooked data set and identifies $$S_6$$ as an anomaly. If you want to drill down the code, then start with ‘BagofwordsAnomalyDetectorApp‘ class.

## Terminology

Let’s briefly establish useful terminology.

Bag of words is a set of unique words within a text, where each word is paired with the number of its occurrence. One specific point is that the order of words is ignored by this structure. If word is not presented in the text, then its occurrence will be considered as $$0$$. For example bag of words for ‘eat, read book, ride bicycle, eat, play computer games, write homework, read book, eat, brush teeth, sleep‘ can be written as a following table.

WordNumber of occurrences
eat3
book2
ride1
bicycle1
play1
computer1
games1
write1
homework1
brush1
teeth1
sleep1

Sometimes you can find the visualization as a histogram. For example this one. Notation $$B(x)$$ will be used for bag of words. Following the example for $$S_1$$.

$$B(S_1) = \left( \begin{array}{cc} eat & 3 \\ read & 2 \\ book & 2 \\ ride & 1 \\ bicycle & 1 \\ play & 1 \\ computer & 1 \\ games & 1 \\ write & 1 \\ homework & 1 \\ brush & 1 \\ teeth & 1 \\ sleep & 1 \end{array} \right)$$

Next term is a distance between 2 bags of words. Distance will be written as $$|B(x) – B(y)|$$ and is calculated as a sum of absolute values of the differences for all words appearing in both bags. Following the example.

$$|B(read\ article\ and\ book) – B(write\ book\ and\ book)| = \\ = \left| \left( \begin{array}{cc} read & 1 \\ write & 0 \\ article & 1 \\ and & 1 \\ book & 1 \end{array} \right) – \left( \begin{array}{cc} read & 0 \\ write & 1 \\ article & 0 \\ and & 1 \\ book & 2 \end{array} \right) \right| = \textbf{4}$$

Applying this definition, you can calculate the distance between all the example sequences. For example $$|B(S_1) – B(S_2)| = \textbf{12}$$ and $$|B(S_1) – B(S_6)| = \textbf{30}$$. The latter is higher, because the $$S_1$$ and $$S_6$$ are more different in words than. This is analogy to the distance between 2 points in the space.

Last term is probability density function. Probability density function is a continuous function defined over the whole real numbers space which is greater or equal to zero for every input and integral over the whole space is 1. Notation $$P(x)$$ will be used. More formally this means the following.

$$P(x) \ge 0 \quad \forall x \in \mathbb{R} \\ \int_{\mathbb{R}}P(x) = 1$$

Typical example of probability density function is normal distribution. Example source code is using more complex one, called normal distribution mixture. Parameter $$x$$ is called random variable. In very simplistic way, the higher $$P(x)$$ is, the more “likely” variable $$x$$ is. If $$P(x)$$ is low, then variable $$x$$ is falling away from the standard. This will be used when setting up the threshold value. Finally let’s make a note about how to create a probability density from finite number of random variables. If $$[x1,…,x_N]$$ is the set of N random variables (or samples you can collect), then there is a process called estimation which transforms this finite set of numbers into a continuous probability density function $$P$$. Explanation of this process is out of scope for this article, just remember there is such thing. In particular, attached example is using a variation of EM algorithm.

## Process

Now it’s a time to explain the process. The whole process can be separated into 2 phases called training and prediction. Training is the phase where a all the data is iterated through and relatively small model is produced. This is usually the most time consuming operation and outcome is sometimes called predictive model. Once model is prepared, then prediction phase comes into place. In this phase an unknown data record is examined by model. Next let’s drill down the details.

### Training phase

There are required 2 inputs for the training phase.

• Set of activities $$[S_1, …, S_N]$$ This might be the example set from the beginning.
• Sensitivity factor $$\alpha$$ which is just the number initially picked up by human that $$\alpha \geq 0$$ More on this one later.

The whole process is pretty straightforward and you can find the implementation in the source code, class BagofwordsAnomalyDetector, method performTraining.

1. For each activity, calculate a bag of words. Result of this step is $$N$$ bags of words $$[B(S_1), …, B(S_N)]$$
2. Calculate random variables. One random variable is calculated for each bag of words. Result of this step is $$N$$ random variables $$[x_1, …, x_N]$$. Formula for calculation is following
$$x_i = \frac{\sum_{j=1}^N |B(S_i) – B(S_j)|}{N} \quad \forall i = 1..N$$
3. Estimate probability density function $$P$$. This process takes random variables $$[x_1, …, x_N]$$ and produces probability density function $$P$$. Variation of EM algorithm is used in the example program.
4. Calculate the threshold value $$\theta$$. Value is calculated according following formula.
$$\theta = \frac{\sum_{i=1}^N P(x_i)}{N} * \alpha$$

Regarding the sensitivity factor $$\alpha$$. The higher $$\alpha$$ is, the more activities will be identified as anomaly. Problem with unsupervised learning model is that data is not labeled and therefore there is no way to know what the correct answers are and how to set up the optimal $$\alpha$$. Therefore some rule of thumbs are used instead. For example set up $$\alpha$$ to report reasonable percentage of activity as anomaly. Typically it is required that the amount of identified anomalies must be manageable by the human investigators. In the bigger system there is usually a feedback loop which incrementally adjusts the until the optimal value is reached. This is then called reinforcement learning. For this small example, $$\alpha$$ was picked up manually as  by try and error just to reach the goal.

1. Store all bags of words, $$P$$ and $$\theta$$ for later usage.

When training phase finishes, then model is ready to be used in prediction phase.

### Prediction phase

This is the phase when potentially unseen activities are tested by model. Model evaluates them and returns whether activities are considered as an anomaly or not.

The whole process works for each activity $$S_U$$ separately, U stands for “unknown”. And it can be summarized by these points.

1. Calculate bag of words $$B(S_U)$$
2. Calculate random variable $$x_U$$ as
$$x_U = \frac{\sum_{i=i}^N |B(S_i) – B(S_U)|}{N}$$
3. If $$P(x_U) \le \theta$$ than activity $$S_U$$ is considered as anomaly. Otherwise activity is considered as normal.

## Summary

You have learned about relatively simple model for identifying unusual sequence from a bulk of them. Now you can play with source code, try different variations and see how this affect the result. Here are few ideas to start with.

• Normalize bags of words. In other words don’t count the absolute number, just relative frequency.
• Use chunks of more than one word. This is then called n-gram model.
• Try to implement different ways how to measure distance between items, for example sequence alignment.

### Key takeaways

• There is no knowledge about what the correct outcome is at the beginning of the unsupervised learning. Therefore best guess and possibly feedback loop are implemented.
• Predictive models are usually built in the training phase and then used to classify the unknown data in the prediction phase.
• In order to be able find the outliers, abstract features like sentences or actions need to be transformed into a measurable form. After that probability and statistics are used to establish the baselines and find the outliers.

## Unit-Level Performance tuning in Java

When it comes to performance testing, I hear a lot about having a dedicated environment, funky tools like JMeter or Apica, and complicated scenarios. These take a lot of effort to set up and maintain. Therefore, I like to first make sure that the most critical units are well-optimized without any of these tools. One way to make this is through unit-level performance test apps. What’s great about these apps is that there is no need for any special tool, they can be ready to go within a minutes and they are proven to save a lot of time, money, and calls from angry customers.

In this article, I am going to share an example of such a test app. You can do the same in your projects.

Technology stack:

• Netbeans IDE
• Java
• Maven

## Testable Unit

In order to be able to run performance tests for a single unit, there is a need to have well-defined and testable units. Let’s work with an example (I have cooked this one, but you will get the idea).

This is a module for message broadcasting. Core method accepts pipe delimited String as an input, extracts parameters, finds the appropriate username, and broadcasts the messages. Following the implementation.

package com.enterprisemath.articles.unitperformance;

/**
* Provider for user related data.
*
*/
public interface UserProvider {

/**
* Returns user name.
*
* @param userId user id
* @return user name
*/

}

// -----------------------------------------------------

package com.enterprisemath.articles.unitperformance;

import java.util.Date;

/**
*
*/

/**
*
* @param timestamp timestamp
* @param message message
*/

}

// -----------------------------------------------------

package com.enterprisemath.articles.unitperformance;

import com.enterprisemath.utils.Dates;
import com.enterprisemath.utils.ValidationUtils;
import java.util.Date;
import org.apache.commons.lang3.builder.ToStringBuilder;

/**
*
*/

/**
* Provider for user data.
*/
private UserProvider userProvider;

/**
*/

/**
* Creates new instance.
*/

/**
* Guards this object to be consistent. Throws exception if this is not the case.
*/
private void guardInvariants() {
ValidationUtils.guardNotNull(userProvider, "userProvider cannot be null");
}

/**
* Processes message.
*
* @param message message
*/
public void processMessage(String message) {
String[] parts = message.split("\\|");
String txt = null;
Date now = null;
for (String part: parts) {
String[] subs = part.split("=", 2);
if (subs.equals("userId")) {
}
else if (subs.equals("timestamp")) {
now = Dates.parse(subs, "yyyy/MM/dd HH:mm:ss");
}
else if (subs.equals("message")) {
txt = subs;
}
}

ValidationUtils.guardNotNull(userName, "wrong input message, user is missing");
ValidationUtils.guardNotNull(txt, "wrong input message, text is missing");
ValidationUtils.guardNotNull(now, "wrong input message, timestamp is missing");

}

@Override
public String toString() {
}

/**
* Creates new instance.
*
* @param userProvider provider for user data
* @return created object
*/
res.userProvider = userProvider;
res.guardInvariants();
return res;
}

}


If you want to compile this, here is a POM file with dependencies:

<project
xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.enterprisemath.articles</groupId>
<artifactId>unitperformance</artifactId>
<packaging>jar</packaging>
<version>1.0-SNAPSHOT</version>
<name>unitperformance</name>
<url>http://maven.apache.org</url>
<dependencies>
<dependency>
<groupId>com.enterprisemath</groupId>
<artifactId>em-utils</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.10</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-core</artifactId>
<version>1.9.5</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>

The core method is called processMessage. In more detail, this method does following:

1. Splits input into parts.
2. Maps relevant parameters and finds the username.
3. Validates that all mandatory parameters for this module are presented.

If you want to properly test this method, then test case has to cover following:

1. Positive case, when message is broadcasted
2. Negative cases, when validation failed
3. Make sure that no harmful side effect is caused

And here is the example impementation.

package com.enterprisemath.articles.unitperformance;
import com.enterprisemath.utils.Dates;
import com.enterprisemath.utils.Month;
import org.apache.commons.lang3.builder.ToStringBuilder;
import static org.junit.Assert.assertTrue;
import static org.junit.Assert.fail;
import org.junit.Before;
import org.junit.Test;
import static org.mockito.Mockito.mock;
import static org.mockito.Mockito.verify;
import static org.mockito.Mockito.verifyNoMoreInteractions;
import static org.mockito.Mockito.when;

/**
* Test case for message broadcast module.
*
*/

/**
* Provider for user data.
*/
private UserProvider userProvider;

/**
*/

/**
* Tested module.
*/

/**
* Creates new instance.
*/

/**
* Sets up test environment.
*/
@Before
public void setUp() {
userProvider = mock(UserProvider.class);
}

/**
* Tests message processing.
*/
@Test
public void testProcessMessage() {

module.processMessage("userId=1|message=Hello world|timestamp=2017/01/01 12:00:00|ip=127.0.0.1|city=Brno|age=12|occupation=student");

try {
module.processMessage("message=Hello world|timestamp=2017/01/01 12:00:00|ip=127.0.0.1|city=Brno|age=12|occupation=student");
fail("exception expected");
} catch (RuntimeException e) {
assertTrue(e.getMessage(), e.getMessage().contains("wrong input message, user is missing"));
}

try {
module.processMessage("userId=1|message=Hello world0|ip=127.0.0.1|city=Brno|age=12|occupation=student");
fail("exception expected");
} catch (RuntimeException e) {
assertTrue(e.getMessage(), e.getMessage().contains("wrong input message, timestamp is missing"));
}

try {
module.processMessage("userId=1|timestamp=2017/01/01 12:00:00|ip=127.0.0.1|city=Brno|age=12|occupation=student");
fail("exception expected");
} catch (RuntimeException e) {
assertTrue(e.getMessage(), e.getMessage().contains("wrong input message, text is missing"));
}

}

@Override
public String toString() {
}
}


In the unit test you can see one happy case and 3 cases where validation failed. In addition you can see there invocation of verifyNoMoreInteractions to check that broadcastService doesn’t broadcasted any unwanted messages. That would be a harmful side effect. On the other hand userProvider doesn’t need to have this protection, because it performs read only operation (assuming it is truly read only and multiple readings are not causing anything harm). Regardless your coding style it is important to make similar work and make sure code is logically correct before starting with any optimization.

## Test Application and Profiler

Now, when you have a code separated to the isolated unit and well defined unit tests, you are ready to start optimizing the performance. Before writing anything, ask yourself a question: Is this a critical part of the application? If answer is not, then it’s better not to optimize. Examples of critical parts are:

• Methods called hundreds of millions of times.
• Methods that process a lot of records.
• Methods aggregating data from third parties in (near) real-time.

If you evaluated your method as a critical part of the application, then it is time to write the test application. Typically, you can place the test application right next to the unit tests. Here is the example.

package com.enterprisemath.articles.unitperformance;

import com.enterprisemath.utils.Dates;
import com.enterprisemath.utils.Month;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import org.apache.commons.lang3.time.DateUtils;
import static org.mockito.Matchers.any;
import static org.mockito.Mockito.mock;
import static org.mockito.Mockito.when;
import org.mockito.invocation.InvocationOnMock;

/**
* Performance test application for message broadcast module.
*
*/

/**
* Prevents construction.
*/

/**
* Main method.
*
* @param args arguments
*/
public static void main(String args[]) {

//
// set up
System.out.println("Setting up and generating test data");

UserProvider userProvider = mock(UserProvider.class);

};

System.out.println("Generating test data");

public String answer(InvocationOnMock invocation) throws Throwable {
String id = (String) invocation.getArguments();
return "user " + id;
}
});

Date ts = Dates.createDate(2017, Month.JANUARY, 1);
List < String > messages = new ArrayList < String > (1000000);
for (int i = 0; i < 1000000; ++i) {
int usid = i % 50;
messages.add("userId=" + usid + "|message=Hello world|timestamp=" + Dates.format(ts, "yyyy/MM/dd HH:mm:ss") +
"|ip=127.0.0.1|city=Brno|age=12|occupation=student");
}

System.out.println("Set up completed and data generated");

//
// wait to give user chance to connect profiler
System.out.println("Timeout to allow attach profiler");

try {
for (int i = 0; i < 20; ++i) {
System.out.print(".");
}
System.out.println("");
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
System.out.println("Finished waiting for profiler");

//
// test
System.out.println("Started performance test");

long startTime = System.currentTimeMillis();
for (String msg: messages) {
module.processMessage(msg);
}
long endTime = System.currentTimeMillis();

System.out.println("Performance test finished");

//
// dump the result
long duration = endTime - startTime;
System.out.println("Num messages = " + messages.size());
System.out.println("Duration = " + duration + "; referenceDuration = 14882");
System.out.println("Duration / message = " + ((double) duration / messages.size()));
System.out.println("JOB DONE!!!");
}

}


Usually, this is just small application containing four parts:

1. Environment setup and data generation (it is desired to exclude this from measurement).
2. A waiting period to allow user connect profiler.
3. Test execution.
4. Result presentation.

As you can see I have used mockito to mock userProvider. And broadcastService is implemented inline. Both ways allow you to create a unit performance test without even having the real implementation of dependent services. For this purpose, the difference in them is that the mock version carries additional overhead. The right choice depends on the particular use case. That’s all about setup.

## Starting Profiler

When you have your application ready, you can run it and attach the profiler during the prepared waiting period (to get good results, you should attach the profiler during that time). In NetBeans, it is pretty easy. Application can be run by the right click and then Run File option. Profiler is attached from top menu bar as Profile > Attach Profiler, then choose CPU and click Attach. Finally, choose your application and click OK. For illustration, please see the images below.   ## Analyzing Results

If everything is done everything correctly, then the console should look familiar to the following dump after the application finishes. See the profile attachment inside the waiting period (in the middle of the dots):

cd C:\projects\enterprisemath.com\articles\unitperformance; "JAVA_HOME=C:\\Program Files\\Java\\jdk1.8.0_25" cmd /c "\"\"C:\\Program Files\\NetBeans 8.0.1\\java\\maven\\bin\\mvn.bat\" -Dexec.args=\"-classpath %classpath com.enterprisemath.articles.unitperformance.MessageBroadcastModulePerformanceTestApp\" -Dexec.executable=\"C:\\Program Files\\Java\\jdk1.8.0_25\\bin\\java.exe\" -Dexec.classpathScope=test -Dmaven.ext.class.path=C:\\Users\\radek.hecl\\AppData\\Roaming\\NetBeans\\8.0.1\\maven-nblib\\netbeans-eventspy.jar org.codehaus.mojo:exec-maven-plugin:1.2.1:exec\""
Running NetBeans Compile On Save execution. Phase execution is skipped and output directories of dependency projects (with Compile on Save turned on) will be used instead of their jar artifacts.
Scanning for projects...

------------------------------------------------------------------------
Building unitperformance 1.0-SNAPSHOT
------------------------------------------------------------------------

--- exec-maven-plugin:1.2.1:exec (default-cli) @ unitperformance ---
Setting up and generating test data
Generating test data
Set up completed and data generated
Timeout to allow attach profiler
..............Profiler Agent: Waiting for connection on port 5140 (Protocol version: 15)
.Profiler Agent: Established connection with the tool
Profiler Agent: Local accelerated session
.....
Finished waiting for profiler
Started performance test
Performance test finished
Num messages = 1000000
Duration = 15530; referenceDuration = 14882
Duration / message = 0.01553
JOB DONE!!!
Profiler Agent: Connection with agent closed
Profiler Agent: Connection with agent closed
Profiler Agent: JNI OnLoad Initialized successfully
Profiler Agent: 250 classes cached.
Profiler Agent: 250 classes cached.
------------------------------------------------------------------------
BUILD SUCCESS
------------------------------------------------------------------------
Total time: 42.375s
Finished at: Wed May 10 00:32:11 JST 2017
Final Memory: 5M/123M
------------------------------------------------------------------------


From the console output you can read that test took roughly 15 seconds. In addition to the console output, there is a profiler result which looks similar to the following. It is possible to drill down within the profiler result and see how much time program spend in each method. The point of this test is to see details of the processMessage method. It is very clear that majority of time is taken by the getUserName method. In this case, it is caused by calling to the mock class. For simplicity, let’s assume that it would look similar if underline implementation makes a call to the database (in such a case, SQL would need to be sent to the database and the database would need to parse it, pull data, and return the result over some protocol, which would definitely take some time). So as the resolution, let’s consider the method getUserName as a bottleneck to deal with.

## Bottleneck Optimization

As you probably know, the typical way to avoid expensive queries is some form of caching. Let’s try the most primitive one: using HashMap. Here’s how the optimized processMessage method looks:

...

/**
* Cache for users.
*/
private Map < String, String > usersCache = new HashMap < String, String > ();

...

public void processMessage(String message) {
String[] parts = message.split("\\|");
String txt = null;
Date now = null;
for (String part: parts) {
String[] subs = part.split("=", 2);
if (subs.equals("userId")) {
if (usersCache.containsKey(subs)) {
}
else {
}
}
else if (subs.equals("timestamp")) {
now = Dates.parse(subs, "yyyy/MM/dd HH:mm:ss");
}
else if (subs.equals("message")) {
txt = subs;
}
}
ValidationUtils.guardNotNull(userName, "wrong input message, user is missing");
ValidationUtils.guardNotNull(txt, "wrong input message, text is missing");
ValidationUtils.guardNotNull(now, "wrong input message, timestamp is missing");
}


When you run the performance test program with this adjustment, then the whole run takes around 4.4 seconds instead of the original 15 seconds (on the same machine). The profiler result looks like the following: Now the bottleneck becomes the function for parsing dates from the string. This would be the next step for optimization, if required. Before closing, let me add a few notes.

• Using hash maps for caching is probably not what you want in most of real cases.
• Optimization introduced new branch of code which is not covered by current unit test. Good practice is to revisit unit test and get this case properly covered.
• Optimization generally makes code more complex and less readable. Therefore focus first only on the parts of your application which are critical, use profiler to find the real bottleneck within the units and stop optimizing when performance is good enough for your case.

## Summary

This article shows one way of performance tuning at the unit level. This type of optimization has an advantage in that that anyone can do it with only a laptop and a few basic tools and everything can be setup within a minutes. Therefore, this is the great first layer of the performance tuning, which will save you a lot of time and money during the later stages.