3D Rigid Body Motion
Rigid body transformation
In robotics and computer vision, we often need to choose multiple coordinate frames to reflect the motion of cameras,
different links reference to a common selected world frame, aka inertial frame. The transformation between these frames is denoted as a transformation matrix $\mathbf{T}$,
$$
\begin{align}
\mathbf{T} =
\begin{bmatrix}
\mathbf{R} & \mathbf{t} \\
\mathbf{0} & 1
\end{bmatrix}
\end{align}
$$
where $\mathbf{R}$ is a rotation matrix, $\mathbf{t}$ is a translation vector. The transformation describe a relative motion, so we have to specify the motion is in relative to which frame.
For example, we define the pose of the camera frame $C$ expressed in the world frame $W$ as $\mathbf{T}_{WC}$, which means frame C seen from frame W.
Assume a robot moves with a onboard camera. Robot Body frame is $B$, the pose of the camera frame $C$ expressed in the robot body frame $B$ is $\mathbf{T}_{BC}$.
The robot pose can be expressed as,
$$
\mathbf{T}_{WB} = \mathbf{T}_{WC} \mathbf{T}_{BC}^{-1}
$$
The pose indicating transformation from body frame $B$ seen from world frame $W$, which indicating the motion of the robot body, is a transform of a 3D landmark point $\mathbf{X}$ from the body frame $B$ to the world frame $W$.
$$
\begin{align}
\mathbf{X}_W = \mathbf{T}_{WB} \mathbf{X}_B
\end{align}
$$
And the transformation from world frame $W$ to body frame $B$ is the inverse of the above transformation,
$$
\begin{align}
\mathbf{X}_B = \mathbf{T}_{WB}^{-1} \mathbf{X}_W
\end{align}
$$
We will give a illustration of the inverse of the pose in the next section.
Rotation matrix
Let's start with a 2D example, 2-dimensional rotation matrix is denoted as:
$$
\begin{align}
\mathbf{R} =
\begin{bmatrix}
\cos(\theta) & -\sin(\theta) \\
\sin(\theta) & \cos(\theta) \\
\end{bmatrix}
\end{align}
$$
rotation $\mathbf{R}$ rotates vectors in the plane by $\theta$ degrees counterclockwise.
Extend to 3-dimensional rotation matrix:
$$
\begin{align}
\mathbf{R} = \mathbf{R}_x \mathbf{R}_y \mathbf{R}_z =
\begin{bmatrix}
1 & 0 & 0 \\
0 & \cos(\theta) & -\sin(\theta) \\
0 & \sin(\theta) & \cos(\theta) \\
\end{bmatrix}
\begin{bmatrix}
\cos(\theta) & 0 & \sin(\theta) \\
0 & 1 & 0 \\
-\sin(\theta) & 0 & \cos(\theta) \\
\end{bmatrix}
\begin{bmatrix}
\cos(\theta) & -\sin(\theta) & 0 \\
\sin(\theta) & \cos(\theta) & 0 \\
0 & 0 & 1 \\
\end{bmatrix}
\end{align}
$$
$\mathbf{R}$ is orthogonal matrix and its determinant equals to 1.
$$
\begin{align}
\mathbf{R}\mathbf{R}^{\top} = \mathbf{I} \quad \text{hence} \quad \mathbf{R}^{-1} = \mathbf{R}^{\top}
\end{align}
$$
$$
\begin{align}
\det(\mathbf{R}) = 1
\end{align}
$$
Try to prove the determinant of $\mathbf{R}$ is 1?
$$
\begin{align}
\det(\mathbf{R}) = \det(\mathbf{R}_x \mathbf{R}_y \mathbf{R}_z) = \det(\mathbf{R}_x) \det(\mathbf{R}_y) \det(\mathbf{R}_z) = 1 \times 1 \times 1 = 1
\end{align}
$$
If the determinant of a matrix is 1, it means the transformation preserves the volume of the linear space — it does not scale it.
These properties make rotation matrix very unique, it is not just a matrix, but actually a manifold, more specifically, a also a Lie Group and also a Lie Manifold. We will talk about more details about Lie Group and manifold in the Matrix Lie Group section.
Problem of Euler angle and Introduction to Quaternion
So can we just use a 3D vector which contains the rotation angles around x, y, z axes, to represent the rotation? The answer is no since it will bring the problem of Gimbal Lock, shown in the following figure.
Although it is OK to represent the 3D rotation by 3 by 3 rotation matrix, it is not the most compact way to do so since we are literally using 9D vector to represent a 3D rotation. To solve this problem, we use a 4D vector to represent the rotation, called Quaternion, which is
$$
\begin{align}
\mathbf{q} =
\begin{bmatrix}
q_w \\
q_x \\
q_y \\
q_z
\end{bmatrix}
\end{align}
$$
where $q_x, q_y, q_z$ are the imaginary parts, or called vector part, and $q_w$ is the real part, or called scalar part.
Quaternion Mathmatics
If we use complex number to represent roation, it would be easiter to understand quaternion. Recall euler formula,
$$
\begin{align}
e^{i\theta} = \cos(\theta) + i\sin(\theta)
\end{align}
$$
Homogeneous coordinate
if a point $\mathbf{P}$, its coordinate is transformed by $\mathbf{R}_1$ to $\mathbf{P}_1$, then by $\mathbf{R}_2$ to $\mathbf{P}_2$, without translation, we have,
$$
\begin{align}
\mathbf{P}_2 = \mathbf{R}_2 \mathbf{P}_1 = \mathbf{R}_2 \mathbf{R}_1 \mathbf{P}
\end{align}
$$
if a point $\mathbf{P}$, its coordinate is transformed by by $\mathbf{t}_1$ to $\mathbf{P}_1$, then by $\mathbf{t}_2$ to $\mathbf{P}_2$, without rotation, we have,
$$
\begin{align}
\mathbf{P}_2 = \mathbf{t}_2 + \mathbf{P}_1 = \mathbf{t}_2 + \mathbf{t}_1 + \mathbf{P}
\end{align}
$$
Suppose a point $\mathbf{P}$, its coordinate is transformed by [$\mathbf{R}_1$, $\mathbf{t}_1$] to $\mathbf{P}_1$, Then transformed by [$\mathbf{R}_2$, $\mathbf{t}_2$] to $\mathbf{P}_2$.
$$
\begin{align}
\mathbf{P}_1 = \mathbf{R}_1 \mathbf{P} + \mathbf{t}_1
\end{align}
$$
$$
\begin{align}
\mathbf{P}_2 = \mathbf{R}_2 \mathbf{P}_1 + \mathbf{t}_2
\end{align}
$$
$$
\begin{align}
\mathbf{P}_2 = \mathbf{R}_2 (\mathbf{R}_1 \mathbf{P} + \mathbf{t}_1) + \mathbf{t}_2
= \mathbf{R}_2 \mathbf{R}_1 \mathbf{P} + \mathbf{R}_2 \mathbf{t}_1 + \mathbf{t}_2
\neq \mathbf{R}_2 \mathbf{R}_1 \mathbf{P} + \mathbf{t}_1 + \mathbf{t}_2
\end{align}
$$
Multiple transformations cannot be described in a uniform way. How do we solve this problem?
The answer is to represent $\mathbf{P}$ in homogeneous coordinates.
$$
\begin{align}
\begin{bmatrix}
\mathbf{P}_1 \\
1
\end{bmatrix}
=
\begin{bmatrix}
\mathbf{R}_1 & \mathbf{t}_1 \\
\mathbf{0}^{\top} & 1
\end{bmatrix}
\begin{bmatrix}
\mathbf{P} \\
1
\end{bmatrix}
\end{align}
$$
Thus, it can represent the transformations from $\mathbf{P}$ to $\mathbf{P}_1$, then to $\mathbf{P}_2$ in a uniform way.
$$
\begin{align}
\begin{bmatrix}
\mathbf{P}_2 \\
1
\end{bmatrix}
=
\mathbf{T}_2 \mathbf{T}_1
\begin{bmatrix}
\mathbf{P} \\
1
\end{bmatrix}
\end{align}
$$
Original coordinate of 3D point $\mathbf{P}$ and $\mathbf{P}$ in homogeneous coordinates,
$$
\begin{align}
\mathbf{P} =
\begin{bmatrix}
x \\
y \\
z
\end{bmatrix}
\quad
\mathbf{\tilde P} =
\begin{bmatrix}
x \\
y \\
z \\
1
\end{bmatrix}
\end{align}
$$
Homogeneous coordinate is defined only up to scale, all of them represents the same 3D coordinate
$$
\begin{align}
s
\begin{bmatrix}
x \\
y \\
z \\
w
\end{bmatrix}
=
\begin{bmatrix}
x \\
y \\
z \\
w
\end{bmatrix}
=
\begin{bmatrix}
x / w \\
y / w \\
z / w \\
1
\end{bmatrix}
\end{align}
$$
where $s$ is a scalar.
Illustration of the inverse of the pose
Let's take a look at this example, the body frame equals to apply a 90 degree counterclockwise rotation around the z-axis. The pose of the body here is simply
$$
\begin{align}
\mathbf{T}_{WB} =
\begin{bmatrix}
0 & -1 & 0 & 0 \\
1 & 0 & 0 & 0 \\
0 & 0 & 1 & 0 \\
0 & 0 & 0 & 1
\end{bmatrix}
\end{align}
$$
Given a point $\mathbf{X}_W$ in the world frame,
$$
\begin{align}
\mathbf{X}_W =
\begin{bmatrix}
1 \\
0 \\
0 \\
1
\end{bmatrix}
\end{align}
$$
The point in body frame is
$$
\begin{align}
\mathbf{X}_B = \begin{bmatrix}
0 \\
-1 \\
0 \\
1
\end{bmatrix}
= \mathbf{T}_{WB}^{-1} \mathbf{X}_W
=
\begin{bmatrix}
0 & 1 & 0 & 0 \\
-1 & 0 & 0 & 0 \\
0 & 0 & 1 & 0 \\
0 & 0 & 0 & 1
\end{bmatrix}
\begin{bmatrix}
1 \\
0 \\
0 \\
1
\end{bmatrix}
\end{align}
$$
Conversion between Coordinate Conventions
The coordinate convention usually consists of left-hand and right-hand system. To tell if a coordinate system is left-hand or right-hand, we just thumbs up for both hands. The direction of the thumb is the direction of the z-axis, and the direction of the fingers is the direction of the x-axis to y-axis. Right hand system is more common, but some softwares like Unity and UnrealEngine do uses left hand system.
The confusing part usually not just from the left and right hand system, but also from the difination of X, Y, Z axis. For example, in ROS, X axis is forward, Y is left and Z is up. Different softwares also uses different coordinate conventions. OpenCV, COLMAP, Open3D usually follow the same convention, ROS, OpenGL also use a different convention. The following figures the illustration of the different conventions.
The problem then is how to convert the coordinate convention between different softwares. The answer is to use the a rotation matrix. This is also called change of basis in linear algebra. Note that only when the coordinate conventions are all using either left-hand or right-hand system, we can use a rotation matrix to convert the coordinate convention.
We first start with converting a reference 3D landmark point $\mathbf{X}$ (in homogeneous coordinates), from the OpenCV convention to ROS convention. The same point will have different coordinates in different conventions, obviously.
$$
\begin{equation}
\begin{aligned}
\mathbf{X}_{\text{ROS}} & = \mathbf{T}_{\text{OpenCV_to_ROS}} \mathbf{X}_{\text{OpenCV}} \\
& =
\begin{bmatrix}
0 & 0 & 1 & 0 \\
-1 & 0 & 0 & 0 \\
0 & -1 & 0 & 0 \\
0 & 0 & 0 & 1
\end{bmatrix} \mathbf{X}_{\text{OpenCV}}
\end{aligned}
\end{equation}
$$
So what are we doing here? It's actually quite intuitive, the resultant X-axis in ROS equals to the Z-axis in OpenCV, the Y-axis in ROS is the flip of the X-axis in OpenCV, and the Z-axis in ROS is the flip of the Y-axis in OpenCV. The camera pose in world frame is,
The tricky part comes from converting a transformation, aka a pose, from one convention to another. We will have to do a little derivation here. In Visual odometry or geometric vision,
assume we compute the relative pose $\mathbf{T}_{C_0C}$ between the current frame $C$ and the first frame $C_0$, and also reconstruct a 3D landmark point $\mathbf{X}$ in the first frame $C_0$.
$$
\begin{equation}
\begin{aligned}
\mathbf{T}_{WC} &= \mathbf{T}_{WC_0} \mathbf{T}_{C_0C} \mathbf{T}_{CC} \\
& = \mathbf{T}_{\text{OpenCV_to_ROS}} \mathbf{T}_{C_0C} \mathbf{I}
\end{aligned}
\end{equation}
$$
And in general, the conversion between two conventions is,
$$
\begin{align}
\mathbf{T}_{WC}^{\text{tgt}} = \mathbf{T}_{\text{world_conversion}} \mathbf{T}_{WC}^{\text{src}} \mathbf{T}_{\text{camera_conversion}}^{-1}
\end{align}
$$
This equation can easily be proved, assume we have a pose $\mathbf{T}_{AB}$, indicating the pose of frame B in frame A. We want to get the pose of frame D in frame C. And we know the conversion A $\rightarrow$ C, meaning transform a point from A to C, denoted as $\mathbf{T}_{CA}$, B $\rightarrow$ D, meaning transform of a point from B to D, denoted as $\mathbf{T}_{DB}$. Here B and D are camera frames, and A and C are world frames.
$$
\begin{equation}
\begin{aligned}
\mathbf{T}_{CD} &= \mathbf{T}_{CA} \mathbf{T}_{AB} \mathbf{T}_{BD} \\
&= \mathbf{T}_{CA} \mathbf{T}_{AB} \mathbf{T}_{DB}^{-1}
\end{aligned}
\end{equation}
$$
References
- Visual SLAM: From Theory to Practice, by Xiang Gao, Tao Zhang, Qinrui Yan and Yi Liu