Welcome in part 4 of this AR series. In the previous chapter, you could see how homography makes possible to draw into a projected planar surface. This chapter will extend the previously calculated homography into form, which allows drawing 3D objects into the scene.

The program related to this chapter is CameraPoseVideoTestApp. You can download the whole project right here.

The structure here would be the same as in the previous chapter. First, you will see the equations and then the practical example at the end. Don’t be stress about the number of parameters and variables. It’s not that difficult, once it comes to coding.

## Camera and Homography

The camera is a device which projects points from 3D space into to 2D plane. For this project, I have chosen to use the classical pinhole camera model, without worrying about perspective distortions. This model makes point projection as simple as matrix-vector multiplication in homogeneous coordinates (arrows on top of the lower case letters symbolize vectors).

\[

\vec{p_{2D}}=P\vec{p_{3D}} \\

\begin{bmatrix} wx_{2D} \\ wy_{2D} \\ w \end{bmatrix}

=

\begin{bmatrix} p_{11} & p_{12} & p_{13} & p_{14} \\ p_{21} & p_{22} & p_{23} & p_{24} \\ p_{31} & p_{32} & p_{33} & p_{34} \end{bmatrix}

\begin{bmatrix} x_{3D} \\ y_{3D} \\ z_{3D} \\ 1 \end{bmatrix}

\]

\(P\) is called a projection matrix and has 3 rows, 4 columns. This realizes the dimension drop into the projection plane.

The camera is a product which has some properties, most notably it’s a focal length. Important is that these properties are constant for a given camera (assuming you are not zooming). This is the internal set of properties. Then there is an external set of properties which is a position and direction of the camera.

This can be reflected in a matrix language as decomposing the matrix \(P\) into a 3×3 calibration matrix \(K\) (internal matrix with camera properties), and a 3×4 view matrix \(V\) (external matrix with camera position and rotation). These matrices are sometimes called intrinsic and extrinsic. And you drill them down into the following form.

\[

P=KV=K[R|T]=K[R_1|R_2|R_3|T]=

\begin{bmatrix} f_x & s & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix}

\begin{bmatrix} r_{11} & r_{12} & r_{13} & t_x \\ r_{21} & r_{22} & r_{23} & t_y \\ r_{31} & r_{32} & r_{33} & t_z \end{bmatrix}

\]

- \(f_x\) and \(f_y\) are focal lengths in the respective axes.
- \(s\) is a skew factor.
- \(c_x\) and \(c_y\) are the principal points of the camera.
- \(R\) is a camera rotation matrix. \(R_1,R_2,R_3\) are columns of the rotation matrix, and \(r_{ab}\) are the elements. The rotation matrix is
**orthonormal**(unit vectors, and orthogonal to each other). Remember this one, because it will be discussed later. - \(T\) is a camera translations vector with elements \(t_x,t_y,t_z\).

### Calibration Matrix

All the elements of matrix \(K\) are the properties of the camera. One way to get them is to make the proper measurement. If you want to do that, then OpenCV contains a pretty lot of materials for that. I just picked them up manually as the following.

- \(f_x,f_y=400\ or\ 800\)
- \(s=0\)
- \(c_x,c_y=\) center of the input image (for 640×480 image, these will be 320 and 240)

### Relation with Homography

To show you how camera pose and homography are related, let’s start with writing down the equations for point projection.

\[

\begin{bmatrix} wx_{2D} \\ wy_{2D} \\ w \end{bmatrix}

=

\begin{bmatrix} p_{11} & p_{12} & p_{13} & p_{14} \\ p_{21} & p_{22} & p_{23} & p_{24} \\ p_{31} & p_{32} & p_{33} & p_{34} \end{bmatrix}

\begin{bmatrix} x_{3D} \\ y_{3D} \\ z_{3D} \\ 1 \end{bmatrix}

=

K[R|T]\begin{bmatrix} x_{3D} \\ y_{3D} \\ z_{3D} \\ 1 \end{bmatrix} = \\

=

K[R_1|R_2|R_3|T]\begin{bmatrix} x_{3D} \\ y_{3D} \\ z_{3D} \\ 1 \end{bmatrix} = \\

=

\begin{bmatrix} f_x & s & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix}

\begin{bmatrix} r_{11} & r_{12} & r_{13} & t_x \\ r_{21} & r_{22} & r_{23} & t_y \\ r_{31} & r_{32} & r_{33} & t_z \end{bmatrix}

\begin{bmatrix} x_{3D} \\ y_{3D} \\ z_{3D} \\ 1 \end{bmatrix}

\]

If \(z_{3D}=0\), then equations will look like this.

\[

\begin{bmatrix} wx_{2D} \\ wy_{2D} \\ w \end{bmatrix}

=

\begin{bmatrix} p_{11} & p_{12} & p_{13} & p_{14} \\ p_{21} & p_{22} & p_{23} & p_{24} \\ p_{31} & p_{32} & p_{33} & p_{34} \end{bmatrix}

\begin{bmatrix} x_{3D} \\ y_{3D} \\ 0 \\ 1 \end{bmatrix}

=

K[R|T]\begin{bmatrix} x_{3D} \\ y_{3D} \\ 0 \\ 1 \end{bmatrix} = \\

=

K[R_1|R_2|R_3|T]\begin{bmatrix} x_{3D} \\ y_{3D} \\ 0 \\ 1 \end{bmatrix} = \\

=

\begin{bmatrix} f_x & s & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix}

\begin{bmatrix} r_{11} & r_{12} & r_{13} & t_x \\ r_{21} & r_{22} & r_{23} & t_y \\ r_{31} & r_{32} & r_{33} & t_z \end{bmatrix}

\begin{bmatrix} x_{3D} \\ y_{3D} \\ 0 \\ 1 \end{bmatrix}

\]

Then you can make the matrix multiplication to figure out, that you can drop the third column of the rotation matrix and z coordinate of the 3D point and get the same results (**reminder, you can do this only if \(z_{3D}=0\), otherwise it won’t work**). This will give you the following.

\[

\begin{bmatrix} wx_{2D} \\ wy_{2D} \\ w \end{bmatrix}

=

\begin{bmatrix} p_{11} & p_{12} & p_{14} \\ p_{21} & p_{22} & p_{24} \\ p_{31} & p_{32} & p_{34} \end{bmatrix}

\begin{bmatrix} x_{3D} \\ y_{3D} \\ 1 \end{bmatrix}

=

K[R_1|R_2|T]\begin{bmatrix} x_{3D} \\ y_{3D} \\ 1 \end{bmatrix} = \\

=

\begin{bmatrix} f_x & s & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix}

\begin{bmatrix} r_{11} & r_{12} & t_x \\ r_{21} & r_{22} & t_y \\ r_{31} & r_{32} & t_z \end{bmatrix}

\begin{bmatrix} x_{3D} \\ y_{3D} \\ 1 \end{bmatrix}

\]

Now note that \([R_1|R_2|T]\) is a 3×3 matrix and at the same time, you can consider that \(K[R_1|R_2|T]=H\) from the previous chapter. That’s how homography is related to the camera projection. And that’s also why you can project points on the \(z=0\) plane, without worrying about the camera internal parameters at all.

## Extending Homography

Going to full camera pose. Seems the easiest way is to calculate \([R_1|R_2 |T]=K^{-1}H\), then make \(R_3=R_1\times R_2\) and have full \([R|T]\) matrix.

Unfortunately, this doesn’t work. Remember, a little bit above I mentioned that matrix \(R\) is orthonormal? \(K\) and \(H\) are already coming out of estimations, carrying errors, so it’s not guaranteed that \(R_1\) and \(R_2\) obtained in this way are orthonormal. That would make the final image look weird. Therefore let’s make them orthonormal.

The implementation of the following text is available inside Ar class, method estimateMvMatrix. And here I would like to refer you the “Augmented Reality with Python and OpenCV” article written by Juan Gallostra. This is where I first discovered the method which I am going to describe at the moment.

Let’s start by constructing \([G_1|G_2|G_3 ]=K^{-1}H\). In the implementation, you will also see that I am negating the homography matrix before plugging it into the equation. That’s because the real pinhole camera would project the flipped image, but there is no flipping here.

Now \([G_1|G_2|G_3]\) is close to desired \([R_1|R_2|T]\), because it’s still the estimation. Therefore \([G_1|G_2|G_3]\) is nearly orthonormal. Then you can write.

\[

l=\sqrt{\| G_1 \| \| G_2 \|} ,\ \

G_1’=\frac{G_1}{l} ,\ \

G_2’=\frac{G_2}{l} ,\ \

G_3’=\frac{G_3}{l} \\

\vec{c}=G_1′ + G_2′ ,\ \

\vec{p}=G_1′ \times G_2′ ,\ \

\vec{d}=\vec{c} \times \vec{p} \\

R_1=\frac{1}{\sqrt{2}}\left( \frac{\vec{c}}{\| \vec{c} \| } + \frac{\vec{d}}{\| \vec{d} \| } \right) ,\ \

R_2=\frac{1}{\sqrt{2}}\left( \frac{\vec{c}}{\| \vec{c} \| } – \frac{\vec{d}}{\| \vec{d} \| } \right) \\

R_3=R_1 \times R_2 ,\ \

T=G_3′

\]

Then you can stack vectors into columns to get the final \(V=[R_1|R_2|R_3|T]\) 3×4 matrix. Finally, compute \(P=KV\) and start projecting points.

## Summary

Now you know, how to draw 3D objects into the scene. So far, all the drawing is done through the simple image operations, which is useful only for the basic demos. In the last chapter, you will discover how to hook up the whole thing with video and OpenGL to make more funky stuff.

## 2 Replies to “AR By Hand – Part 4 – Camera Pose”

You actually make it seem so easy with your presentation but I find this matter to be actually something which I think I would never understand.

It seems too complex and extremely broad for me.

I am looking forward for your next post, I will try to get the hang

of it!

Hello, first of all, thank you for reading. I would love to be helpful and explain you in more details the parts you don’t understand.

Please can you be more specific? For example point me to the concrete parts, or say whether the confusion is in the math, camera properties, coding, or in something else.