AR By Hand – Part 5 – Video and OpenGL

Welcome to the last chapter of this AR series. The previous chapter ended up by drawing a cube model on top of the marker paper. This chapter will close the whole series by showing you how to connect the whole system with OpenGL.

For start, let me give you a few words about the OpenGL. Although the end goal for this project is to render a simple cube, it’s not that trivial to do that in the OpenGL. You will still need knowledge about the graphics pipeline, shaders, and various buffer objects. Just covering this is already enough for its own tutorial series. Luckily, there is a great book called “Computer Graphics Programming in OpenGL with JAVA Second Edition” written by Gordon, V. Scott. I would encourage you to read this one if you want to learn about OpenGL. I personally used this book to create this project. The only notable difference is that JOGL library is now available in the maven repository, which makes installation super easy.

The source code for this chapter starts at CameraPoseJoglTestApp executable class.

Video Processing

This is an easy part. There is a java library called xuggle-video from OpenIMAJ. Processing video works in the way, that you open the reader, register listeners, and read packets while processing the events until the end of the stream.

As for the source code. There is a VideoUtils class which allows processing videos synchronously using lambda functions. This one I used to produce videos in the previous chapters. In addition, there is a VideoPlayer class which plays the video in the separate thread and lets you process the frames in the callback. This is the class used in the OpenGL application.


I am going to cover only how to fit the previously created matrices into the OpenGL ones. The general usage of OpenGL to draw shapes is not discussed here.

In the previous chapter, you learned about 3×3 matrix \(K\) and 3×4 matrix \(V\). When you start work with OpenGL, then you will see that all the matrices are 4×4. What a hell?

Don’t worry, they are all related. Detailed article series, although a bit difficult to digest, covering also this point was published by Kyle Simek back in 2012.

Be aware that it requires some effort to make things displayed correctly. There are many conventions on the way and mistake in just one sign will result in weird result, or nothing is displayed at all.

To recap. The result of the previous chapter was internal matrix \(K\) and external matrix \(V\). And you could multiply them as \(P=KV\) to get the projection matrix. The full matrices are these ones.

K=\begin{bmatrix} f_x & s & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix} \\
V=[R|T]=\begin{bmatrix} r_{11} & r_{12} & r_{13} & t_x \\ r_{21} & r_{22} & r_{23} & t_y \\ r_{31} & r_{32} & r_{33} & t_z \end{bmatrix}

Projection and Internal

The OpenGL projection matrix is related to the internal matrix \(K\). OpenGL Projection Matrix is a nice article which explains how the whole projection and clipping works. To give you a brief idea, let’s look at the following image (taken from that article).

The left image shows the space which is displayed at the end of the graphics pipeline. Anything out of that cut pyramid is not displayed. The camera is in the origin, oriented towards negative Z. There are 2 parallel planes forming the top and bottom of the pyramid. They are called near and far and are defined by the scalar values. Near plane is also the place where the pixels are projected. These are OpenGL parameters introduced for practical reasons and you need to choose them.

The right image shows the mapping of the volume into normalized device coordinates (NDC). This allows things like clipping or depth buffers.

In summary, the OpenGL projection matrix does 2 things – perspective projection and mapping to NDC. This can be expressed as following matrix multiplication.
Therefore, you need to create these 2 matrices, having the following parameters.

  • Values of the matrix \(K\) (\(f_x,\ f_y,\ s,\ c_x,\ c_y\))
  • near and far values – I have chosen 0.1 and 1000 respectively
  • width and height are the width and height in pixels of the original input image

With all the parameters, it’s possible to write down the matrices right away.

f_x & s & -c_x & 0 \\
0 & f_y & -c_y & 0 \\
0 & 0 & near + far & near * far \\
0 & 0 & -1 & 0
\end{bmatrix} \\
P_{NDC}= \begin{bmatrix}
\frac{-2}{width} & 0 & 0 & 1 \\
0 & \frac{2}{height} & 0 & -1 \\
0 & 0 & \frac{2}{near-far} & \frac{far+near}{near-far} \\
0 & 0 & 0 & 1
\end{bmatrix} \\

Code for this in in the class Ar, method toGlProj.

View and External

The OpenGL view matrix \(V_{GL}\) is easy to construct by taking the \(V\) matrix and add the \([0,0,0,1]\) vector as the 4th row. Like this.
r_{11} & r_{12} & r_{13} & t_x \\
r_{21} & r_{22} & r_{23} & t_y \\
r_{31} & r_{32} & r_{33} & t_z \\
0 & 0 & 0 & 1
Code for this in in the class Ar, method toGlV.

Everything Together

Finally, you can see and run everything through the CameraPoseJoglTestApp executable class. Few things to mention.

  • Video is rendered to the texture, which is then drawn to the screen as 2 triangles.
  • Video processing and OpenGL loop need synchronization. Otherwise, it won’t work.
  • There are 2 sets of the shader programs. One for the background video, one for the 3D world.

That’s it. Here you have the video with the result.


And this is the end. Although there is s*^^*t lot of space for improvements, I hope you have enjoyed this series and learned something new. I would be more than happy if you post me your comments, questions, suggestions for improvements, or ideas for other AR-related projects. Or just another topic you are struggling with. I would love to be helpful.

See you in something else (;

AR By Hand – Part 4 – Camera Pose

Welcome in part 4 of this AR series. In the previous chapter, you could see how homography makes possible to draw into a projected planar surface. This chapter will extend the previously calculated homography into form, which allows drawing 3D objects into the scene.

The program related to this chapter is CameraPoseVideoTestApp. You can download the whole project right here.

The structure here would be the same as in the previous chapter. First, you will see the equations and then the practical example at the end. Don’t be stress about the number of parameters and variables. It’s not that difficult, once it comes to coding.

Camera and Homography

The camera is a device which projects points from 3D space into to 2D plane. For this project, I have chosen to use the classical pinhole camera model, without worrying about perspective distortions. This model makes point projection as simple as matrix-vector multiplication in homogeneous coordinates (arrows on top of the lower case letters symbolize vectors).

\vec{p_{2D}}=P\vec{p_{3D}} \\
\begin{bmatrix} wx_{2D} \\ wy_{2D} \\ w \end{bmatrix}
\begin{bmatrix} p_{11} & p_{12} & p_{13} & p_{14} \\ p_{21} & p_{22} & p_{23} & p_{24} \\ p_{31} & p_{32} & p_{33} & p_{34} \end{bmatrix}
\begin{bmatrix} x_{3D} \\ y_{3D} \\ z_{3D} \\ 1 \end{bmatrix}

\(P\) is called a projection matrix and has 3 rows, 4 columns. This realizes the dimension drop into the projection plane.

The camera is a product which has some properties, most notably it’s a focal length. Important is that these properties are constant for a given camera (assuming you are not zooming). This is the internal set of properties. Then there is an external set of properties which is a position and direction of the camera.

This can be reflected in a matrix language as decomposing the matrix \(P\) into a 3×3 calibration matrix \(K\) (internal matrix with camera properties), and a 3×4 view matrix \(V\) (external matrix with camera position and rotation). These matrices are sometimes called intrinsic and extrinsic. And you drill them down into the following form.

\begin{bmatrix} f_x & s & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix}
\begin{bmatrix} r_{11} & r_{12} & r_{13} & t_x \\ r_{21} & r_{22} & r_{23} & t_y \\ r_{31} & r_{32} & r_{33} & t_z \end{bmatrix}

  • \(f_x\) and \(f_y\) are focal lengths in the respective axes.
  • \(s\) is a skew factor.
  • \(c_x\) and \(c_y\) are the principal points of the camera.
  • \(R\) is a camera rotation matrix. \(R_1,R_2,R_3\) are columns of the rotation matrix, and \(r_{ab}\) are the elements. The rotation matrix is orthonormal (unit vectors, and orthogonal to each other). Remember this one, because it will be discussed later.
  • \(T\) is a camera translations vector with elements \(t_x,t_y,t_z\).

Calibration Matrix

All the elements of matrix \(K\) are the properties of the camera. One way to get them is to make the proper measurement. If you want to do that, then OpenCV contains a pretty lot of materials for that. I just picked them up manually as the following.

  • \(f_x,f_y=400\ or\ 800\)
  • \(s=0\)
  • \(c_x,c_y=\) center of the input image (for 640×480 image, these will be 320 and 240)

Relation with Homography

To show you how camera pose and homography are related, let’s start with writing down the equations for point projection.
\begin{bmatrix} wx_{2D} \\ wy_{2D} \\ w \end{bmatrix}
\begin{bmatrix} p_{11} & p_{12} & p_{13} & p_{14} \\ p_{21} & p_{22} & p_{23} & p_{24} \\ p_{31} & p_{32} & p_{33} & p_{34} \end{bmatrix}
\begin{bmatrix} x_{3D} \\ y_{3D} \\ z_{3D} \\ 1 \end{bmatrix}
K[R|T]\begin{bmatrix} x_{3D} \\ y_{3D} \\ z_{3D} \\ 1 \end{bmatrix} = \\
K[R_1|R_2|R_3|T]\begin{bmatrix} x_{3D} \\ y_{3D} \\ z_{3D} \\ 1 \end{bmatrix} = \\
\begin{bmatrix} f_x & s & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix}
\begin{bmatrix} r_{11} & r_{12} & r_{13} & t_x \\ r_{21} & r_{22} & r_{23} & t_y \\ r_{31} & r_{32} & r_{33} & t_z \end{bmatrix}
\begin{bmatrix} x_{3D} \\ y_{3D} \\ z_{3D} \\ 1 \end{bmatrix}

If \(z_{3D}=0\), then equations will look like this.
\begin{bmatrix} wx_{2D} \\ wy_{2D} \\ w \end{bmatrix}
\begin{bmatrix} p_{11} & p_{12} & p_{13} & p_{14} \\ p_{21} & p_{22} & p_{23} & p_{24} \\ p_{31} & p_{32} & p_{33} & p_{34} \end{bmatrix}
\begin{bmatrix} x_{3D} \\ y_{3D} \\ 0 \\ 1 \end{bmatrix}
K[R|T]\begin{bmatrix} x_{3D} \\ y_{3D} \\ 0 \\ 1 \end{bmatrix} = \\
K[R_1|R_2|R_3|T]\begin{bmatrix} x_{3D} \\ y_{3D} \\ 0 \\ 1 \end{bmatrix} = \\
\begin{bmatrix} f_x & s & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix}
\begin{bmatrix} r_{11} & r_{12} & r_{13} & t_x \\ r_{21} & r_{22} & r_{23} & t_y \\ r_{31} & r_{32} & r_{33} & t_z \end{bmatrix}
\begin{bmatrix} x_{3D} \\ y_{3D} \\ 0 \\ 1 \end{bmatrix}

Then you can make the matrix multiplication to figure out, that you can drop the third column of the rotation matrix and z coordinate of the 3D point and get the same results (reminder, you can do this only if \(z_{3D}=0\), otherwise it won’t work). This will give you the following.
\begin{bmatrix} wx_{2D} \\ wy_{2D} \\ w \end{bmatrix}
\begin{bmatrix} p_{11} & p_{12} & p_{14} \\ p_{21} & p_{22} & p_{24} \\ p_{31} & p_{32} & p_{34} \end{bmatrix}
\begin{bmatrix} x_{3D} \\ y_{3D} \\ 1 \end{bmatrix}
K[R_1|R_2|T]\begin{bmatrix} x_{3D} \\ y_{3D} \\ 1 \end{bmatrix} = \\
\begin{bmatrix} f_x & s & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix}
\begin{bmatrix} r_{11} & r_{12} & t_x \\ r_{21} & r_{22} & t_y \\ r_{31} & r_{32} & t_z \end{bmatrix}
\begin{bmatrix} x_{3D} \\ y_{3D} \\ 1 \end{bmatrix}

Now note that \([R_1|R_2|T]\) is a 3×3 matrix and at the same time, you can consider that \(K[R_1|R_2|T]=H\) from the previous chapter. That’s how homography is related to the camera projection. And that’s also why you can project points on the \(z=0\) plane, without worrying about the camera internal parameters at all.

Extending Homography

Going to full camera pose. Seems the easiest way is to calculate \([R_1|R_2 |T]=K^{-1}H\), then make \(R_3=R_1\times R_2\) and have full \([R|T]\) matrix.

Unfortunately, this doesn’t work. Remember, a little bit above I mentioned that matrix \(R\) is orthonormal? \(K\) and \(H\) are already coming out of estimations, carrying errors, so it’s not guaranteed that \(R_1\) and \(R_2\) obtained in this way are orthonormal. That would make the final image look weird. Therefore let’s make them orthonormal.

The implementation of the following text is available inside Ar class, method estimateMvMatrix. And here I would like to refer you the “Augmented Reality with Python and OpenCV” article written by Juan Gallostra. This is where I first discovered the method which I am going to describe at the moment.

Let’s start by constructing \([G_1|G_2|G_3 ]=K^{-1}H\). In the implementation, you will also see that I am negating the homography matrix before plugging it into the equation. That’s because the real pinhole camera would project the flipped image, but there is no flipping here.

Now \([G_1|G_2|G_3]\) is close to desired \([R_1|R_2|T]\), because it’s still the estimation. Therefore \([G_1|G_2|G_3]\) is nearly orthonormal. Then you can write.

l=\sqrt{\| G_1 \| \| G_2 \|} ,\ \
G_1’=\frac{G_1}{l} ,\ \
G_2’=\frac{G_2}{l} ,\ \
G_3’=\frac{G_3}{l} \\
\vec{c}=G_1′ + G_2′ ,\ \
\vec{p}=G_1′ \times G_2′ ,\ \
\vec{d}=\vec{c} \times \vec{p} \\
R_1=\frac{1}{\sqrt{2}}\left( \frac{\vec{c}}{\| \vec{c} \| } + \frac{\vec{d}}{\| \vec{d} \| } \right) ,\ \
R_2=\frac{1}{\sqrt{2}}\left( \frac{\vec{c}}{\| \vec{c} \| } – \frac{\vec{d}}{\| \vec{d} \| } \right) \\
R_3=R_1 \times R_2 ,\ \

Then you can stack vectors into columns to get the final \(V=[R_1|R_2|R_3|T]\) 3×4 matrix. Finally, compute \(P=KV\) and start projecting points.


Now you know, how to draw 3D objects into the scene. So far, all the drawing is done through the simple image operations, which is useful only for the basic demos. In the last chapter, you will discover how to hook up the whole thing with video and OpenGL to make more funky stuff.

AR By Hand – Part 3 – Homography

Welcome in part 3 of this AR series. In the previous chapter, you could read about how to detect and track white A4 paper. The result was 4 points in the image corresponding to the corners. This chapter will use these points to build a homography. That’s the next step towards the AR experience.
This article goes first through the mathematics behind homography, and then shows the use case relevant to this project.


Here, I will briefly review the terms and then derive the equation system for homography. If you follow this, you will see why you need to detect at least 4 points which aren’t on the line.


Correspondence. Imagine have 2 photos of the same object from a slightly different position. Then the point \(p_1\) and \(p_2\) on the respective images are corresponding if they are projecting the same physical point. Sign for correspondence is \(p_1\ \hat{=}\ p_2\).

Homogeneous coordinates. This is a coordinate system used in projective geometry and will be used here from now on as well. 2D vector \(\begin{bmatrix} x\\ y \end{bmatrix}\) in cartesian coordinates is expressed as 3D vector \(\begin{bmatrix} wx\\ wy\\ w \end{bmatrix}, \forall w\neq 0\) in homogeneous coordinates. Similarly, 3D vectors in cartesian coordinates are 4D vectors in homogeneous coordinates. Also, note that \(\begin{bmatrix} w_{1}x\\ w_{1}y\\ w_1 \end{bmatrix}=\begin{bmatrix} w_{2}x\\ w_{2}y\\ w_2 \end{bmatrix}, \forall w_1\neq 0,\ w_2\neq 0\) in homogeneous coordinates.

Matrices are used to represent certain geometric transformations in the homogeneous coordinates. Transformation of the point \(p\) is realized by a simple multiplication so that \(p’=Mp\). In addition, transformations can merged into a single one by the standard matrix multiplication.

Homography Equations

Mr. Wikipedia says that any two images of the same planar surface in space are related by a homography (assuming a pinhole camera model).

In other words, if \(I\) and \(I’\) are 2 images, containing same planar surface, then there exists a 3×3 matrix \(H\) which maps points \(p\) into corresponding points \(p’\), such as \(p’=Hp\). Remember that these points must be on that plane.

Let’s write down the equations in more details.

H=\begin{bmatrix} h_{11} & h_{12} & h_{13} \\ h_{21} & h_{22} & h_{23} \\ h_{31} & h_{32} & h_{33} \end{bmatrix} \\
\begin{bmatrix} w’x’ \\ w’y’ \\ w’ \end{bmatrix}=\begin{bmatrix} h_{11} & h_{12} & h_{13} \\ h_{21} & h_{22} & h_{23} \\ h_{31} & h_{32} & h_{33} \end{bmatrix}\begin{bmatrix} wx \\ wy \\ w \end{bmatrix} \\
\begin{bmatrix} w’x’ \\ w’y’ \\ w’ \end{bmatrix}=\begin{bmatrix} h_{11}wx + h_{12}wy + h_{13}w \\ h_{21}wx + h_{22}wy + h_{23}w \\ h_{31}wx + h_{32}wy + h_{33}w \end{bmatrix}

The goal is to figure out 9 elements of matrix \(H\). Without losing any generality, you can assume that \(w = 1\) and switch to the cartesian coordinates by division. This will make the following equation.

\begin{bmatrix} x’ \\ y’ \end{bmatrix}=\begin{bmatrix} \frac{h_{11}x + h_{12}y + h_{13}}{h_{31}x + h_{32}y + h_{33}} \\ \frac{h_{21}x + h_{22}y + h_{23}}{h_{31}x + h_{32}y + h_{33}} \end{bmatrix}

This equation system has 9 degrees of freedom. Luckily, you can multiply all elements of \(H\) by a non-zero \(k\) without having affecting the solution at all. This removes 1 degree of freedom and opens 2 possible ways for a solution.

First way is to set \(h_{33} = 1\). You can do this as soon as \(h_{33}\neq 0\). Second, more general way, is to impose unit vector constraint such as \(h_{11}^2+h_{12}^2+h_{13}^2+h_{21}^2+h_{22}^2+h_{23}^2+h_{31}^2+h_{32}^2+h_{33}^2=1\). Here I will use the first way because it seems to be more intuitive and better supported by the numerical libraries.

Homography Solution

Setting \(h_{33}=1\) will give the following.

\begin{bmatrix} x’ \\ y’ \end{bmatrix}=\begin{bmatrix} \frac{h_{11}x + h_{12}y + h_{13}}{h_{31}x + h_{32}y + 1} \\ \frac{h_{21}x + h_{22}y + h_{23}}{h_{31}x + h_{32}y + 1} \end{bmatrix}

After separating to the components, multiplying, and reorganizing you will get these 2 equations.

x’=h_{11}x + h_{12}y + h_{13} – h_{31}xx’ – h_{32}yx’ \\
y’=h_{21}x + h_{22}y + h_{23} – h_{31}xy’ – h_{32}yy’

These are the linear equations with 8 unknowns. Therefore, in theory, it is required to have 8 equations (with certain preconditions to make sure the system is not degenerated) to be able to figure out the unknowns.

In practice, we have an estimated 4 corner points of the marker paper. Although there are some errors carried out of the image processing part, these points do not lie on a single line. Therefore it is possible to plug them into equations and use numerical methods to get the approximated solution with minimal error. This is how the equations look like.

x_1 & y_1 & 1 & 0 & 0 & 0 & -x_1x_1′ & -y_1x_1′ \\
0 & 0 & 0 &x_1 & y_1 & 1 & -x_1y_1′ & -y_1y_1′ \\
x_2 & y_2 & 1 & 0 & 0 & 0 & -x_2x_2′ & -y_2x_2′ \\
0 & 0 & 0 &x_2 & y_2 & 1 & -x_2y_2′ & -y_2y_2′ \\
\cdots & \cdots & \cdots & \cdots & \cdots & \cdots & \cdots & \cdots
h_{11} \\ h_{12} \\ h_{13} \\ h_{21} \\ h_{22} \\ h_{23} \\ h_{31} \\ h_{32}
x_1′ \\ y_1′ \\ x_2′ \\ y_2′ \\ \cdots

I won’t go into numerics here. I just use the solver provided by the mathematics library to get a solution like this one.

H=\begin{bmatrix} h_{11} & h_{12} & h_{13} \\ h_{21} & h_{22} & h_{23} \\ h_{31} & h_{32} & 1 \end{bmatrix} \\

The source code for homography estimation is located inside the Ar class, method estimateHomography.

Use Case

Homography has several use cases, you can easily find them on the internet. Here just the one relevant to this project. Let’s estimate the homography in the way that detected rectangle corresponds to the fixed rectangle. Then draw the blue square to the fixed rectangle and corresponding square to the original image. The result is right below.


This chapter covered homography. This allows you to draw into the planar surfaces of the image. In the next chapter, you will learn how to extend homography to get the projection matrix and be able to draw 3D objects lying on top of that plane.

AR By Hand – Part 2 – Plane Detection

Welcome back to AR series. The previous chapter introduced the whole project. This chapter will cover the first topic on the list, plane detection, and tracking. At the end of this article, you will be able to identify the corner points of the A4 paper from the input image, in the right order so you can draw a rectangle contour in there. Example outcome of this chapter is in the following image.

The full implementation is available in the accompanying project. The download button is right below this paragraph. The main class for this article is WhiteMarkerTracker. I would encourage you to look into the source code while reading the article. Now let’s get into it.


Pre-processing is the first phase in image processing. The goal of the pre-processing is to clean up the image and extract only the information usable in the further phase. Since, in many cases, this phase has to go through every pixel of the image, only relatively simple operations can be performed. The result is a list of “features” (you will see what this word means in a little bit) useful for more detailed processing. And many times, it’s desirable to have a much smaller number of features than the number of pixels.

In our case, the goal is to identify the white rectangle. And the good way to start is by identifying the pixels lying on the edge of the white area. That pixels are the “features” in this particular context. And they can be extracted by the following process.

  1. Threshold pixel by the color (white color must have red, green and blue components high enough).
  2. Identify all the connected areas (blobs).
  3. Pick up the biggest blob (assuming the marker paper is the dominant white object in the image) and throw away all the others. This cleans certain artifacts.
  4. If the biggest blob doesn’t have enough volume (means number of pixels), then exit.
  5. Identify contour points of the blob. These are the white pixels next to the black pixel.

The result is a set of contour points, illustrated in the image below. To give you rough numbers. Input image has 640×480 pixels, which is slightly over 300,000 pixels in total. Pre-processing chooses, given reasonable input, less than 3,000 pixels. This reduces the amount of data by the factor of 1,000.

Let me add one more note. Based on the way how you acquire the input image, you might need to apply additional operation(s) like Gaussian smoothing to get the reasonable contour. I have used an image from a video sequence, where compression algorithm already did a similar job, therefore it wasn’t necessary in my case.

Once contour pixels are selected, then the next phase can start.


Now you have a set of contour pixels. So what to do with them? Although you can see they mostly form the edges of the marker paper, there are still some errors. In addition, they are just pixels, which doesn’t tell the position of the corner points. And in some cases, corner points locations might be not that well defined. Like in the following image.

Detection algorithm first identifies 4 edge lines and then calculates the corner positions from the intersections. Note that there are several points on the way where the algorithm can fail to detect something. In such a case, it reports that there wasn’t anything detected.

Edge Lines

The good method for identifying edge lines, while having errors and outliers in there, is RANSAC (Random Sample Consensus). The general RANSAC workflow is following.

  1. Repeat N times (N is up to you)
    1.  Randomly choose a minimal number of points you need to build the model
    2. Build the model from chosen points
    3. Compare the model with other sample points and calculate the number of good fits (inliers, or points which are close enough to the expected positions)
    4. If the number of inliers is high enough, then accept the model and terminate the cycle
  2. You either have a model, or there is a high probability that the model doesn’t exist

Now more concrete for the edges of the marker paper. The main difference is that we want to find 4 lines lying as much as possible over the contour points. For this, it is necessary to choose the maximal number of iterations we are willing to take (N), minimum number of points lying “close enough” to the line in order to accept the line (minAccept – good is to use % of the total number of sample points), and distance from the line which is considered as “close enough” (dstThres). The full algorithm is in the class Fitting, method linesRansac. Here just a brief description.

  1. Start with empty result list
  2. Repeat max N times, stop if result list has desired a number of lines (4 in this case)
    1. Pick up 2 different random points from sample set
    2. Build the line
    3. Calculate the number of inliers (max distance from the line is dstThres)
    4. If the number of inliers is greater or equal to the minAccept parameter, then
      1. Add the line to the result list
      2. Remove inliers from the sample set (to prevent the same line being detected again)
  3. Return the result list

If you run this algorithm, then you will “ideally” (I will get back to the word “ideally” later in this chapter) end up with lines like in this image.

You see, RANSAC is tolerant of the various form of distractions. All you need is to have enough number of sample points being “close enough”. Now once edge lines are known, the final shape can be extracted.

Paper Shape

Going from edges to the corner points is a matter of identifying which lines are perpendicular, calculating intersections and ordering them counterclockwise. The full implementation is in the WhiteMarkerTracker class, method extractPoints.

Identifying the perpendicular lines is possible by the angle examination because we know that the rectangle has 2 sets of 2 parallel lines. If you select any edge line, then the parallel line will always have the smallest angle in between. And 2 perpendicular lines are the remaining lines which are not the parallel one. The angle between the lines is possible to calculate from the line equations. The same for the intersection. The ordering of the points just requires to use a little bit of vector algebra and messing around.

If everything is done, then you should be able to see the image like this one.

So, are we done? Not so fast…

RANSAC Problems

Remember, before I told the word “ideally”? This part is all about that.

Let’s make a little experiment. Let’s take the example image from this chapter and make a 200 frames video out of it. In every frame let’s perform the plane detection as described so far and follow up by estimating AR parameters and draw a 3D cube on top of the plane (don’t worry if you don’t know how to do this yet). This is how the result would look like.

The cube is shaking, although it should stay still. In addition, there are several frames which are completely wrong. This is caused by the nature of the RANSAC method.  RANSAC is a probabilistic method which randomly selects points to create a model. This has two consequences.

  1. Every estimation is slightly different, although most of them are reasonably good. This is the reason for shaking. Especially because the errors are summing up on the way.
  2. There is a small chance that some model is wrong yet still fits enough points to be accepted. This is the reason for several frames being totally wrong.

To be more illustrative, let’s see how the current line detection looks like.

At this video, you can clearly see that 2 RANSAC detections of the same line might be both reasonably good, yet slightly different. This is the root cause of the shaking cube in the previous video. Also, time to time you can see the miss-detection causing single edge being detected twice. This is the root cause of the cube being rendered in the wrong position.

How to improve that?

Stabilization With Tracking

Although simple RANSAC method isn’t good enough to produce a good result, it’s still a reasonable start. Therefore, let’s use the initial detection and enhance it.

There are 2 enhancements (both are implemented inside WhiteMarkerTracker class, method trackMarker).

  1. Track previously detected shape
  2. Smooth the estimation by averaging

First, let’s discuss the tracking. Tracking is done a frame by frame. In each frame, the algorithm knows the parameters of the old object (4 corner points in this case) and the new observations (contour points in this case). The result is either an updated parameter set or report that the object has been lost.

This implementation works by tracking edges one by one and then putting them together. Assuming, that edges move between 2 frames doesn’t change significantly. Corner points are used to define an area, where the edge is expected to be. This reduces the number of contour points and therefore allows to require a higher percentage of inliers for the RANSAC estimation. Like in the image below.

Now regarding the smoothing. Smoothing is normally achieved by averaging. Therefore, for every tracked line. Let RANSAC estimate M good fits for that line, rather than just one. Then take the set of points, where each is close enough to at least one of these good fits. Make the final line as a least-square fit from that set of points.

When you put everything together, then the result would look like this video.


This chapter explained to you how to track A4 marker paper in the video. Next chapter will use the outcome to build up a homography.

AR By Hand – Part 1 – Introduction

Chances are, you already at least heard the term augmented reality. The very generic definition says that augmented reality is a real-world enhanced by computer-generated information. This mini-series will be about enhancing the captured image or video by adding a 3D model into it. Like in this video.

If you want to know how this works, then keep reading. I will walk you through all the steps. The following image is the rough summarization of the steps.

Now in simple words. The final goal is to display a cube on top of the white A4 paper. The process to do that starts by thresholding the pixels to find the white ones. This produces several blobs. Assuming the A4 paper is the biggest of them. Others are suppressed. Next step is to identify contour, edge lines and corners of the paper. Then the planar transformation is computed. Planar transformation is used to compute a projection matrix which allows drawing 3D objects to the original image. Finally, the projection matrix is transformed into the OpenGL compatible form. Then you can do anything with that.

Sounds trivial, right? Maybe. Still, it takes some effort to go through the details, therefore I have prepared the following chapters.

In addition, there is an example project accompanying this series. You can download it right here.

Code is written in Java (+ few OpenGL shaders) and build by Maven. As soon as you understand these, you should be able to build the project and run the test applications. Test applications are executable classes within test sources. Main classes are CameraPoseVideoTestApp and CameraPoseJoglTestApp.

Regarding the expected level of knowledge. It will be very helpful if you have some knowledge about linear algebra, homogenous coordinates, RGB image representation, pinhole camera model and perspective projection. Although I will try to keep the required level to the minimum, it is too much to explain every little thing in detail.
Now let me make a note about the quality of the result. There are 2 main factors which affect quality – implementation and environment. I will cover one type of implementation. Will let you judge how good it is. Please leave me comments, especially if you have a concrete idea to improve. The second factor which matters is the environment. This includes everything from camera quality, noise, distractions in the scene, lighting, occlusion, till the time you can spend on the processing each frame. Even today’s state of the art algorithms will fail under the crappy environment. Please keep this in mind, when you do your own experiments.


This chapter gave you an overall idea of the project. Next chapter will tell you how to track the plane.