Factor Graph Optimization

From this tutorial, we will start to explore the back-end of visual SLAM. We will start from the optimization-based methods, which is usually formulated as a pose graph optimization problem.

The sparse bundle adjustment (BA) problem

In the bundle adjustment problem, we want to optimize the camera poses and the 3D points by minimizing the reprojection error, which is denoted as, \begin{equation} \begin{aligned} & \arg \min_{\mathbf{T}_{CWi}, \mathbf{X}_{Wj}} \left( \frac{1}{2} \sum_{i=1}^{M} \sum_{j=1}^{N}\left\| \mathbf{x}_j - \mathbf{K} \mathbf{T}_{CWi} \mathbf{X}_{Wj} \right\|_2^2 \right) \\ &= \arg \min_{\mathbf{T}_{CWi}, \mathbf{X}_{Wj}} \left( \frac{1}{2} \sum_{i=1}^{M} \sum_{j=1}^{N}\left\| \mathbf{x}_j - \pi (\mathbf{T}_{CWi}, \mathbf{X}_{Wj}) \right\|_2^2 \right) \\ &= \arg \min_{\mathbf{T}_{CWi}, \mathbf{X}_{Wj}} f = \arg \min_{\mathbf{T}_{CWi}, \mathbf{X}_{Wj}} \left( \frac{1}{2} \sum_{i=1}^{M} \sum_{j=1}^{N} \mathbf{e}_{ij}^2 \right) \\ \label{eq:reproj_error} \end{aligned} \end{equation} The variables we want to optimize are the camera poses and the 3D points, which are denoted as: \begin{equation} \mathbf{\mathcal{X}} = [\boldsymbol{\xi}_1, \ldots, \boldsymbol{\xi}_i, \ldots, \boldsymbol{\xi}_M, \mathbf{X}_1, \ldots, \mathbf{X}_j, \ldots, \mathbf{X}_N]^\top \end{equation} where $\boldsymbol{\xi}_i \in \mathfrak{se}(3)$ is the Lie algebra of the camera pose $\mathbf{T}_{CWi} \in \text{SE}(3)$.
Recall the Gaussian-Newton method for solving the non-linear least squares problem, where we linearlize the objective function $f(\mathbf{x})$, \begin{equation} f(\mathbf{x} + \Delta\mathbf{x}) \approx f(\mathbf{x}) + \mathbf{J}(\mathbf{x})^\top \Delta\mathbf{x} \end{equation} and the optimal solution can be found by solving, \begin{equation} \Delta\mathbf{x}^* = \arg\min_{\Delta\mathbf{x}} \tfrac{1}{2}\|f(\mathbf{x}) + \mathbf{J}(\mathbf{x})^\top \Delta\mathbf{x}\|^2 \end{equation} The optimal $\mathbf{x}$ is solved with its normal equation, \begin{equation} \mathbf{J}(\mathbf{x})^\top \mathbf{J}(\mathbf{x})\Delta\mathbf{x} = -\mathbf{J}(\mathbf{x})^\top f(\mathbf{x}) \end{equation} \begin{equation} \mathbf{H}\Delta\mathbf{x} = \mathbf{g} \end{equation} where $\mathbf{H}$ is the second-order approximation of the Hessian.

Apply this to the BA problem, here we just $\mathbf{J}_C$ to denote the Jacobian over pose and $\mathbf{J}_P$ the Jacobian over 3D landmark points, \begin{equation} \begin{aligned} \frac{1}{2}\| f(\mathbf{\mathcal{X}} + \Delta\mathbf{\mathcal{X}})\|^2 &\approx \frac{1}{2}\sum_{i=1}^M \sum_{j=1}^N \|\mathbf{e}_{ij} + {\mathbf{J}_C}_{ij}\Delta\boldsymbol{\xi}_i + {\mathbf{J}_P}_{ij}\Delta\mathbf{X}_{j}\|^2 \\ &= \frac{1}{2}\|\mathbf{e} + \mathbf{J}_C\Delta \mathbf{\mathcal{X}_c} + \mathbf{J}^X\Delta\mathbf{\mathcal{X}_p}\|^2 \end{aligned} \end{equation} The incremental updates for $\mathcal{X}$ is represented as \begin{equation} \mathbf{H}\Delta\mathbf{\mathcal{X}} = -\mathbf{J}^\top \mathbf{e} \end{equation} \begin{equation} \Delta\mathbf{\mathcal{X}} = - \mathbf{H}^{-1} \mathbf{J}^\top \mathbf{e} = -( \mathbf{J}^\top \mathbf{J} )^{-1} \mathbf{J}^\top \mathbf{e} \end{equation} The Jacobian which combines $\mathbf{J}_C$ and $\mathbf{J}_P$ is denoted as, \begin{equation} \mathbf{J} = [\mathbf{J}_C, \mathbf{J}_P]^\top \end{equation} And and approximated Hessian matrix is, \begin{equation} \begin{aligned} \mathbf{H} = \mathbf{J}^\top \mathbf{J} &= \begin{bmatrix} {\mathbf{J}_C}^\top \mathbf{J}_C & {\mathbf{J}_C}^\top \mathbf{J}_P \\ {\mathbf{J}_P}^\top \mathbf{J}_C & {\mathbf{J}_P}^\top \mathbf{J}_P \end{bmatrix} \\ &= \begin{bmatrix} \mathbf{H}_{CC} & \mathbf{H}_{CP} \\ \mathbf{H}_{PC} & \mathbf{H}_{PP} \end{bmatrix} \end{aligned} \end{equation}

Sparsity of the Hessian structure

In visual SLAM, the local observability between camera poses and 3D points makes the Jacobian of the reprojection error term $\mathbf{e}$ w.r.t $\mathbf{\mathcal{X}}$ sparse. Think about a camera moving in a scene, only a subset of the 3D point map is observed by a subset of camera pose trajectory.

Hessian sparsity introduction

We will first go through a concrete example to illustrate the sparsity of the Hessian matrix. Consider the case where we have $M=2$ cameras and $N=3$ 3D landmark points. The illustration and its factor graph is shown as, The reprojection error term is \begin{equation} \frac{1}{2} \sum_{i=1}^M \sum_{j=1}^N \|\mathbf{e}_{ij}\|^2 \end{equation} Take the Jacobian of $\mathbf{e}_{1}$ of $\mathbf{\mathcal{X}_{11}}$, hence the first pose and first 3D point, as the example, \begin{equation} \mathbf{J}_{11} = \frac{\partial \mathbf{e}_{11}}{\partial \mathcal{X}} = \Big[ \tfrac{\partial \mathbf{e}_{11}}{\partial \boldsymbol{\xi}_1}, [\mathbf{0}]_{2\times6}, \tfrac{\partial \mathbf{e}_{11}}{\partial \mathbf{X}_1}, [\mathbf{0}]_{2\times3}, [\mathbf{0}]_{2\times3} \Big] \end{equation} \begin{equation} \mathbf{J}_{12} = \frac{\partial \mathbf{e}_{12}}{\partial \mathbf{\mathcal{X}}} = \Big[ \tfrac{\partial \mathbf{e}_{12}}{\partial \boldsymbol{\xi}_1}, [\mathbf{0}]_{2\times6}, [\mathbf{0}]_{2\times3}, \tfrac{\partial \mathbf{e}_{12}}{\partial \mathbf{X}_2}, [\mathbf{0}]_{2\times3} \Big] \end{equation} Since the 3rd landmark point is not observed by the 1st camera, $\mathbf{J}_{13}$ is actually not defined and we simply skip the computation of it. And then we continue for the 2nd camera, \begin{equation} \mathbf{J}_{22} = \frac{\partial \mathbf{e}_{22}}{\partial \mathbf{\mathcal{X}}} = \Big[ [\mathbf{0}]_{2\times6}, \tfrac{\partial \mathbf{e}_{22}}{\partial \boldsymbol{\xi}_2}, [\mathbf{0}]_{2\times3}, \tfrac{\partial \mathbf{e}_{22}}{\partial \mathbf{X}_2}, [\mathbf{0}]_{2\times3} \Big] \end{equation} \begin{equation} \mathbf{J}_{23} = \frac{\partial \mathbf{e}_{23}}{\partial \mathbf{\mathcal{X}}} = \Big[ [\mathbf{0}]_{2\times6}, \tfrac{\partial \mathbf{e}_{23}}{\partial \boldsymbol{\xi}_2}, [\mathbf{0}]_{2\times3}, [\mathbf{0}]_{2\times3}, \tfrac{\partial \mathbf{e}_{23}}{\partial \mathbf{X}_3} \Big] \end{equation} stack them together and get the whole jacobian matrix, \begin{equation} \mathbf{J} = \begin{bmatrix} \mathbf{J}_{11} \\ \mathbf{J}_{12} \\ \mathbf{J}_{22} \\ \mathbf{J}_{23} \\ \end{bmatrix} \end{equation} Thus the Hessian matrix $\mathbf{H}$ can be derived as, \begin{equation} \mathbf{H} = \mathbf{J}^\top \mathbf{J} = \begin{bmatrix} \mathbf{J}_{11} ^\top, \mathbf{J}_{12} ^\top, \mathbf{J}_{22} ^\top, \mathbf{J}_{23} ^\top \\ \end{bmatrix} \begin{bmatrix} \mathbf{J}_{11} \\ \mathbf{J}_{12} \\ \mathbf{J}_{22} \\ \mathbf{J}_{23} \\ \end{bmatrix} = \sum_{i,j} \mathbf{J}_{ij}^\top \mathbf{J}_{ij} \end{equation} \begin{equation} \mathbf{J}_{11}^\top \mathbf{J}_{11} = \begin{bmatrix} \left( \frac{\partial \mathbf{e}_{11}}{\partial \boldsymbol{\xi}_1} \right)^\top \frac{\partial \mathbf{e}_{11}}{\partial \boldsymbol{\xi}_1}, [\mathbf{0}]_{6\times6}, \left( \frac{\partial \mathbf{e}_{11}}{\partial \boldsymbol{\xi}_1} \right)^\top \frac{\partial \mathbf{e}_{11}}{\partial \mathbf{X}_1}, [\mathbf{0}]_{3\times3}, [\mathbf{0}]_{3\times3} \\ \mathbf{0} \\ \left( \frac{\partial \mathbf{e}_{11}}{\partial \mathbf{X}_1} \right)^\top \frac{\partial \mathbf{e}_{11}}{\partial \boldsymbol{\xi}_1}, [\mathbf{0}]_{3\times6}, \left( \frac{\partial \mathbf{e}_{11}}{\partial \boldsymbol{\xi}_1} \right)^\top \frac{\partial \mathbf{e}_{11}}{\partial \mathbf{X}_1}, [\mathbf{0}]_{3\times3}, [\mathbf{0}]_{3\times3} \\ \mathbf{0} \\ \mathbf{0} \\ \end{bmatrix} \end{equation} \begin{equation} \mathbf{J}_{12}^\top \mathbf{J}_{12} = \begin{bmatrix} \left( \frac{\partial \mathbf{e}_{12}}{\partial \boldsymbol{\xi}_1} \right)^\top \frac{\partial \mathbf{e}_{12}}{\partial \boldsymbol{\xi}_1} , [\mathbf{0}]_{6\times6} , [\mathbf{0}]_{6\times3} , \left( \frac{\partial \mathbf{e}_{12}}{\partial \boldsymbol{\xi}_1} \right)^\top \frac{\partial \mathbf{e}_{12}}{\partial \mathbf{X}_2} , [\mathbf{0}]_{6\times3} \\ \mathbf{0} \\ \mathbf{0} \\ \left( \frac{\partial \mathbf{e}_{12}}{\partial \mathbf{X}_2} \right)^\top \frac{\partial \mathbf{e}_{12}}{\partial \boldsymbol{\xi}_1} , [\mathbf{0}]_{3\times6} , [\mathbf{0}]_{3\times3} , \left( \frac{\partial \mathbf{e}_{12}}{\partial \mathbf{X}_2} \right)^\top \frac{\partial \mathbf{e}_{12}}{\partial \mathbf{X}_2} , [\mathbf{0}]_{3\times3} \\ \mathbf{0} \end{bmatrix} \end{equation} \begin{equation} \mathbf{J}_{22}^\top \mathbf{J}_{22} = \begin{bmatrix} \mathbf{0} \\ [\mathbf{0}]_{6\times6} , \left( \frac{\partial \mathbf{e}_{22}}{\partial \boldsymbol{\xi}_2} \right)^\top \frac{\partial \mathbf{e}_{22}}{\partial \boldsymbol{\xi}_2} , [\mathbf{0}]_{6\times3} , \left( \frac{\partial \mathbf{e}_{22}}{\partial \boldsymbol{\xi}_2} \right)^\top \frac{\partial \mathbf{e}_{22}}{\partial \mathbf{X}_2} , [\mathbf{0}]_{6\times3} \\ \mathbf{0} \\ [\mathbf{0}]_{3\times6} , \left( \frac{\partial \mathbf{e}_{22}}{\partial \mathbf{X}_2} \right)^\top \frac{\partial \mathbf{e}_{22}}{\partial \boldsymbol{\xi}_2} , [\mathbf{0}]_{3\times3} , \left( \frac{\partial \mathbf{e}_{22}}{\partial \mathbf{X}_2} \right)^\top \frac{\partial \mathbf{e}_{22}}{\partial \mathbf{X}_2} , [\mathbf{0}]_{3\times3} \\ \mathbf{0} \end{bmatrix} \end{equation} \begin{equation} \mathbf{J}_{23}^\top \mathbf{J}_{23} = \begin{bmatrix} \mathbf{0} \\ [\mathbf{0}]_{6\times6} ,\left( \frac{\partial \mathbf{e}_{23}}{\partial \boldsymbol{\xi}_2} \right)^\top \frac{\partial \mathbf{e}_{23}}{\partial \boldsymbol{\xi}_2} , [\mathbf{0}]_{6\times3} , [\mathbf{0}]_{6\times3} , \left( \frac{\partial \mathbf{e}_{23}}{\partial \boldsymbol{\xi}_2} \right)^\top \frac{\partial \mathbf{e}_{23}}{\partial \mathbf{X}_3} \\ \mathbf{0} \\ \mathbf{0} \\ [\mathbf{0}]_{3\times6} , \left( \frac{\partial \mathbf{e}_{23}}{\partial \mathbf{X}_3} \right)^\top \frac{\partial \mathbf{e}_{23}}{\partial \boldsymbol{\xi}_2} , [\mathbf{0}]_{3\times3} , [\mathbf{0}]_{3\times3} , \left( \frac{\partial \mathbf{e}_{23}}{\partial \mathbf{X}_3} \right)^\top \frac{\partial \mathbf{e}_{23}}{\partial \mathbf{X}_3} \end{bmatrix} \end{equation} We define derivative terms as, \begin{equation} \mathbf{E}_{ij} \triangleq \frac{\partial \mathbf{e}_{ij}}{\partial \boldsymbol{\xi}_i} \in \mathbb{R}^{2\times 6}, \qquad \mathbf{F}_{ij} \triangleq \frac{\partial \mathbf{e}_{ij}}{\partial \mathbf{X}_j} \in \mathbb{R}^{2\times 3}. \end{equation} And the Hessian $\mathbf{H}$ is, \begin{aligned} \mathbf{H} &= \begin{bmatrix} \mathbf{H}_{CC} & \mathbf{H}_{CP} \\ \mathbf{H}_{PC} & \mathbf{H}_{PP} \end{bmatrix} \\ &= \left[ \begin{array}{cc|ccc} \mathbf{E}_{11}^\top\mathbf{E}_{11} + \mathbf{E}_{12}^\top\mathbf{E}_{12} & [\mathbf{0}]_{6\times6} & \mathbf{E}_{11}^\top\mathbf{F}_{11} & \mathbf{E}_{12}^\top\mathbf{F}_{12} & [\mathbf{0}]_{6\times3} \\[6pt] [\mathbf{0}]_{6\times6} & \mathbf{E}_{22}^\top\mathbf{E}_{22} + \mathbf{E}_{23}^\top\mathbf{E}_{23} & [\mathbf{0}]_{6\times3} & \mathbf{E}_{22}^\top\mathbf{F}_{22} & \mathbf{E}_{23}^\top\mathbf{F}_{23} \\ \hline \mathbf{F}_{11}^\top\mathbf{E}_{11} & [\mathbf{0}]_{3\times6} & \mathbf{F}_{11}^\top\mathbf{F}_{11} & [\mathbf{0}]_{3\times3} & [\mathbf{0}]_{3\times3} \\[6pt] \mathbf{F}_{12}^\top\mathbf{E}_{12} & \mathbf{F}_{22}^\top\mathbf{E}_{22} & [\mathbf{0}]_{3\times3} & \mathbf{F}_{12}^\top\mathbf{F}_{12} + \mathbf{F}_{22}^\top\mathbf{F}_{22} & [\mathbf{0}]_{3\times3} \\[6pt] [\mathbf{0}]_{3\times6} & \mathbf{F}_{23}^\top\mathbf{E}_{23} & [\mathbf{0}]_{3\times3} & [\mathbf{0}]_{3\times3} & \mathbf{F}_{23}^\top\mathbf{F}_{23} \end{array} \right] \end{aligned} It is obvious that the Hessian matrix is symmetric. The upper left block of the Hessian matrix $\mathbf{H}_{CC} \in \mathbb{R}^{6M \times 6M}$ is block diagonal and all about camera pose $\boldsymbol{\xi}$ and the lower right block $\mathbf{H}_{PP} \in \mathbb{R}^{3N \times 3N}$ is also block diagonal and all about 3D points $\mathbf{X}$. The effective size of the Hessian used for optimization depends on the observability between camera poses and 3D points. In fact, the observability information is also hidden in $\mathbf{H}$ itself. Think about the last block in the first row block of $\mathbf{H}$, it is a $[\mathbf{0}]$ block because the 3rd 3D landmark is not observable by the 1st camera. Same for the 3rd block in the second row block, it is $[\mathbf{0}]$ because the 1st 3D point is not observable by camera 2.
In summary, the Jacobian of the reprojection error over the $i$-th camera pose and the $j$-th 3D landmark points are, \begin{equation} \mathbf{J}_{ij}(\mathcal{X}) = \Big[ \underbrace{[\mathbf{0}]_{2\times6}\;\cdots\;[\mathbf{0}]_{2\times6}}_{\text{poses } 1\ldots i-1}\;\; \tfrac{\partial \mathbf{e}_{ij}}{\partial \boldsymbol{\xi}_i}\;\; \underbrace{[\mathbf{0}]_{2\times6}\;\cdots\;[\mathbf{0}]_{2\times6}}_{\text{poses } i+1\ldots M}\;\; \underbrace{[\mathbf{0}]_{2\times3}\;\cdots\;[\mathbf{0}]_{2\times3}}_{\text{landmarks } 1\ldots j-1}\;\; \tfrac{\partial \mathbf{e}_{ij}}{\partial \mathbf{X}_j}\;\; \underbrace{[\mathbf{0}]_{2\times3}\;\cdots\;[\mathbf{0}]_{2\times3}}_{\text{landmarks } j+1\ldots N} \Big] \end{equation} We use $\mathbf{\mathcal{C}}_j$ and $\mathbf{\mathcal{P}}_i$ to denote the set of observable cameras for the $j$-th 3D point and, observable 3D points for the $i$-th camera. We also use $| \mathbf{\mathcal{C}}_j |$ and $| \mathbf{\mathcal{P}}_i |$ to denote their sizes. And the Hessian matrix is, \begin{equation} \mathbf{H} = \left[ \begin{array}{cccc|cccc} \sum\limits_{j}^{| \mathbf{\mathcal{P}}_1 |} \mathbf{E}_{1j}^{\top}\mathbf{E}_{1j} & \mathbf{0} & \cdots & \mathbf{0} & \mathbf{E}_{11}^{\top}\mathbf{F}_{11} & \mathbf{E}_{12}^{\top}\mathbf{F}_{12} & \cdots & \mathbf{E}_{1N}^{\top}\mathbf{F}_{1N} \\ \mathbf{0} & \sum\limits_{j}^{| \mathbf{\mathcal{P}}_2 |} \mathbf{E}_{2j}^{\top}\mathbf{E}_{2j} & \cdots & \mathbf{0} & \mathbf{E}_{21}^{\top}\mathbf{F}_{21} & \mathbf{E}_{22}^{\top}\mathbf{F}_{22} & \cdots & \mathbf{E}_{2N}^{\top}\mathbf{F}_{2N} \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots \\ \mathbf{0} & \mathbf{0} & \cdots & \sum\limits_{j}^{| \mathbf{\mathcal{P}}_M |} \mathbf{E}_{Mj}^{\top}\mathbf{E}_{Mj} & \mathbf{E}_{M1}^{\top}\mathbf{F}_{M1} & \mathbf{E}_{M2}^{\top}\mathbf{F}_{M2} & \cdots & \mathbf{E}_{MN}^{\top}\mathbf{F}_{MN} \\ \hline \mathbf{F}_{11}^{\top}\mathbf{E}_{11} & \mathbf{F}_{21}^{\top}\mathbf{E}_{21} & \cdots &\mathbf{F}_{M1}^{\top}\mathbf{E}_{M1} & \sum\limits_{i}^{| \mathbf{\mathcal{C}}_1 |} \mathbf{F}_{i1}^{\top}\mathbf{F}_{i1} & \mathbf{0} & \cdots & \mathbf{0} \\ \mathbf{F}_{12}^{\top}\mathbf{E}_{12} & \mathbf{F}_{22}^{\top}\mathbf{E}_{22} & \cdots & \mathbf{F}_{M2}^{\top}\mathbf{E}_{M2} & \mathbf{0} & \sum\limits_{i}^{| \mathbf{\mathcal{C}}_2 |} \mathbf{F}_{i2}^{\top}\mathbf{F}_{i2} & \cdots & \mathbf{0} \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots \\ \mathbf{F}_{1N}^{\top}\mathbf{E}_{1N} & \mathbf{F}_{2N}^{\top}\mathbf{E}_{2N} & \cdots & \mathbf{F}_{MN}^{\top}\mathbf{E}_{MN} & \mathbf{0} & \mathbf{0} & \cdots & \sum\limits_{i}^{| \mathbf{\mathcal{C}}_N |} \mathbf{F}_{iN}^{\top}\mathbf{F}_{iN} \end{array} \right] \end{equation} Notes that part of the blocks in $\mathbf{H}_{CP}$ are actually $\mathbf{0}$ matrices, here we write them all explicitly for the sake of completeness. But the real $\mathbf{H}_{CP}$ matrix (contains how many $\mathbf{0}$ blocks, e.g. it's sparsity) depends on the observability for the $i-th$ camera and the $j-th$ points. And in practice, the size of 3D points $N$ is often much larger than the size of cameras $M$.

Schur Complement

As we discussed before, the inremental updates for $\mathbf{\mathcal{X}}$ can be solved by the following linear equation, \begin{equation} \begin{bmatrix} \mathbf{H}_{CC} & \mathbf{H}_{CP} \\ \mathbf{H}_{CP}^\top & \mathbf{H}_{PP} \end{bmatrix} \begin{bmatrix} \Delta \mathbf{\mathcal{X}}_C \\ \Delta \mathbf{\mathcal{X}}_P \end{bmatrix} = \begin{bmatrix} \mathbf{J}_C \mathbf{e}_C \\ \mathbf{J}_P \mathbf{e}_P \end{bmatrix} \end{equation} We will use the internal sparsity of Hessian to make the solving more efficient. We want to avoid inversing the entire Hessian matrix by inversing only part of it. In specific, we want to inverse $\mathbf{H}$ by only inversing part of $\mathbf{H}$, ideally the diagonal part, and not $\mathbf{H}_{CP}$ and $\mathbf{H}_{CP}^\top$, which are usually more expensive to inverse than a diagonal matrix. \begin{equation} \begin{bmatrix} \mathbf{I} & -\mathbf{H}_{CP}\mathbf{H}_{PP}^{-1} \\ \mathbf{0} & \mathbf{I} \end{bmatrix} \begin{bmatrix} \mathbf{H}_{CC} & \mathbf{H}_{CP} \\ \mathbf{H}_{CP}^{\top} & \mathbf{H}_{PP} \end{bmatrix} \begin{bmatrix} \Delta \boldsymbol{\mathcal{X}}_C \\ \Delta \boldsymbol{\mathcal{X}}_P \end{bmatrix} = \begin{bmatrix} \mathbf{I} & -\mathbf{H}_{CP}\mathbf{H}_{PP}^{-1} \\ \mathbf{0} & \mathbf{I} \end{bmatrix} \begin{bmatrix} \mathbf{J}_C^{\top}\mathbf{e} \\ \mathbf{J}_P^{\top}\mathbf{e} \end{bmatrix} \end{equation} what we do next is to multiply a matrix on both sides and eliminate the $\mathbf{H}_{CP}$ in the linear system, obtaining, \begin{equation} \begin{bmatrix} \mathbf{H}_{CC} - \mathbf{H}_{CP}\mathbf{H}_{PP}^{-1}\mathbf{H}_{CP}^{\top} & \mathbf{0} \\ \mathbf{H}_{CP}^{\top} & \mathbf{H}_{PP} \end{bmatrix} \begin{bmatrix} \Delta \boldsymbol{\mathcal{X}}_C \\ \Delta \boldsymbol{\mathcal{X}}_P \end{bmatrix} = \begin{bmatrix} \mathbf{J}_C^{\top}\mathbf{e} - \mathbf{H}_{CP}\mathbf{H}_{PP}^{-1}\mathbf{J}_P^{\top}\mathbf{e} \\ \mathbf{J}_P^{\top}\mathbf{e} \end{bmatrix} \end{equation} then we get \begin{equation} \left( \mathbf{H}_{CC} - \mathbf{H}_{CP}\mathbf{H}_{PP}^{-1}\mathbf{H}_{CP}^{\top} \right) \Delta \boldsymbol{\mathcal{X}}_C = \mathbf{J}_C^{\top}\mathbf{e} - \mathbf{H}_{CP}\mathbf{H}_{PP}^{-1}\mathbf{J}_P^{\top}\mathbf{e} \end{equation} and solve $\Delta \mathbf{\mathcal{X}_C}$ as, \begin{equation} \Delta \boldsymbol{\mathcal{X}}_C = \left( \mathbf{H}_{CC} - \mathbf{H}_{CP}\mathbf{H}_{PP}^{-1}\mathbf{H}_{CP}^{\top} \right)^{-1} \left( \mathbf{J}_C^{\top}\mathbf{e} - \mathbf{H}_{CP}\mathbf{H}_{PP}^{-1}\mathbf{J}_P^{\top}\mathbf{e} \right) \end{equation} Matrix $\mathbf{S} = \left(\mathbf{H}_{CC} - \mathbf{H}_{CP}\mathbf{H}_{PP}^{-1}\mathbf{H}_{CP}^{\top} \right)$ is called the Schur complement of $\mathbf{H}_{PP}$. Although $\mathbf{S} $ is neither really sparse nor diagonal, the size is relatively small ($6M \times 6M$) and its construction and solution are typically computationally inexpensive in practice.
The incremental for the 3D points are then solved as, \begin{equation} \Delta \boldsymbol{\mathcal{X}}_P = \mathbf{H}_{PP}^{-1} \left( \mathbf{J}_P^{\top}\mathbf{e} - \mathbf{H}_{CP}^{\top} \Delta \boldsymbol{\mathcal{X}}_C \right) \end{equation} Taking a look back again, we solve $\Delta \boldsymbol{\mathcal{X}}_C$ by inversing a relative small, non-diagonal matrix, and then solve $\Delta \boldsymbol{\mathcal{X}}_P$ by inversing a large but sparse, diagonal matrix.

Pose graph optimization introduction

Sliding Window Filter and Optimization

Why Control BA Size?

Full bundle adjustment (BA) over every frame and landmark is great for offline SLAM/SfM but far too expensive for real-time SLAM. We need strategies to keep the problem small for real-time online optimization. The standard approach is: only promote selected keyframes into the BA; non-keyframes are used for pose tracking only. But keyframes accumulate over time, so we still need to bound the optimization. There're two popular options: Both are engineering trade-offs between optimality and speed. But they raise a subtle theoretical question: when we drop an old keyframe, should we just delete it, or does it still carry information that matters?

Sliding Window Formulation

Suppose our sliding window contains $M$ keyframe poses in Lie algebra form and $N$ landmarks:

\begin{equation} \boldsymbol{\xi}_k \in \mathbb{R}^6, \quad k=1,\dots,M, \qquad \mathbf{X}_j \in \mathbb{R}^3, \quad j=1,\dots,N. \end{equation}

The full state vector has dimension $6M + 3N$:

\begin{equation} \mathbf{\mathcal{X}} = \begin{bmatrix} \boldsymbol{\xi}_1 \\ \vdots \\ \boldsymbol{\xi}_M \\ \mathbf{X}_1 \\ \vdots \\ \mathbf{X}_N \end{bmatrix} \in \mathbb{R}^{6M+3N}. \end{equation}

After BA converges, the cost is locally quadratic, so the posterior is approximately Gaussian with information (Hessian) matrix:

\begin{equation} \mathbf{H} = \mathbf{J}^\top \mathbf{J} = \begin{bmatrix} \mathbf{H}_{CC} & \mathbf{H}_{CP} \\ \mathbf{H}_{PC} & \mathbf{H}_{PP} \end{bmatrix} \end{equation}

where $\mathbf{H}_{PP}$ is block-diagonal (landmarks are conditionally independent). Applying the Schur complement to eliminate landmarks:

\begin{equation} \mathbf{S} = \mathbf{H}_{CC} - \mathbf{H}_{CP}\,\mathbf{H}_{PP}^{-1}\,\mathbf{H}_{PC} \end{equation}

Why does $\mathbf{S}^{-1}$ give the covariance? Recall that a Gaussian can be written in two equivalent forms:

\begin{equation} p(\mathbf{x}) \propto \exp\!\Big(-\tfrac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu})\Big) \quad \Longleftrightarrow \quad p(\mathbf{x}) \propto \exp\!\Big(-\tfrac{1}{2}\mathbf{x}^\top \boldsymbol{\Lambda}\, \mathbf{x} + \boldsymbol{\eta}^\top \mathbf{x}\Big) \end{equation}

where $\boldsymbol{\Lambda} = \boldsymbol{\Sigma}^{-1}$ is the information (precision) matrix. At the BA optimum, the cost is locally quadratic:

\begin{equation} f(\mathbf{\mathcal{X}}^* + \Delta\mathbf{\mathcal{X}}) \approx \text{const} + \tfrac{1}{2}\,\Delta\mathbf{\mathcal{X}}^\top\,\mathbf{H}\,\Delta\mathbf{\mathcal{X}} \end{equation}

Interpreting $\exp(-f)$ as a probability, this is exactly a Gaussian with $\boldsymbol{\Lambda} = \mathbf{H}$, so $\boldsymbol{\Sigma} = \mathbf{H}^{-1}$. After the Schur complement eliminates landmarks, $\mathbf{S}$ plays the role of $\mathbf{H}$ over poses only — it is the reduced information matrix, and $\mathbf{S}^{-1}$ is the pose covariance:

\begin{equation} p(\boldsymbol{\xi}_1,\dots,\boldsymbol{\xi}_M \mid \mathbf{X}_1,\dots,\mathbf{X}_N) \sim \mathcal{N}\!\bigl([\boldsymbol{\mu}_1,\dots,\boldsymbol{\mu}_M]^T,\;\mathbf{S}^{-1}\bigr) \end{equation}

Intuitively, $\mathbf{S}$ encodes how tightly each pose is constrained. Two poses co-observing many landmarks will have large off-diagonal blocks in $\mathbf{S}$ — moving one forces the other to adjust.

Window update

When the window moves:

Adding a keyframe is easy; removing one forces a choice between information loss and structural damage (fill-in).

The Fill-In Problem

Say $\boldsymbol{\xi}_1$ observes landmarks $\mathbf{y}_1$ through $\mathbf{y}_4$. In the Hessian, $\boldsymbol{\xi}_1$'s row has non-zero blocks in the columns of those landmarks. When we marginalize $\boldsymbol{\xi}_1$ (apply the Schur complement to eliminate it), we condition on $\boldsymbol{\xi}_1$'s current estimate. The Schur complement fills in the landmark--landmark block: landmarks that were previously independent (diagonal block) now become coupled. This is called fill-in. Intuitively, marginalizing $\boldsymbol{\xi}_1$ turns the joint distribution into a conditional: \begin{equation} p(\boldsymbol{\xi}_1,\dots,\boldsymbol{\xi}_4,\mathbf{X}_1,\dots,\mathbf{X}_6) = p(\boldsymbol{\xi}_2,\dots,\boldsymbol{\xi}_4,\mathbf{X}_1,\dots,\mathbf{X}_6\mid\boldsymbol{\xi}_1)\; \underbrace{p(\boldsymbol{\xi}_1)}_{\text{discarded}}. \tag{9.1} \end{equation} The conditional factor $p(\cdot\mid\boldsymbol{\xi}_1)$ introduces new constraints (``where these landmarks should be, given $\boldsymbol{\xi}_1$''), destroying the sparse diagonal structure we rely on for efficient BA. If we marginalize landmarks first, fill-in appears in the pose--pose block---tolerable, since BA doesn't require a diagonal pose block. But fill-in in the landmark block kills sparsity and makes iterative solvers infeasible (exactly why early EKF backends couldn't scale).

Maintaining Sparsity: Three Cases

We preserve sparsity by handling each landmark of the departing keyframe $\boldsymbol{\xi}_1$ differently:

Intuition for marginalization in SWF

Marginalizing a keyframe means ``fix its estimate and compute how all other variables should behave, conditioned on it.'' The observed landmarks produce a prior on where they should be; if those landmarks are then also marginalized, their observers inherit a prior on where the keyframe should observe them. Information cascades forward as priors.

What Is a Pose Graph?

After a few BA iterations, landmark positions largely converge while poses still benefit from refinement. This suggests a shortcut: stop optimizing landmarks and treat them only as constraints between poses. The result is a pose graph: vertices are camera poses, edges encode relative-motion estimates (from feature matching, direct methods, GPS, IMU, etc.). Since the number of landmarks far exceeds the number of poses, this dramatically reduces the problem size and is the standard approach for large-scale SLAM and loop closure.

Residuals and Jacobians

Each node is a pose $\mathbf{T}_1,\dots,\mathbf{T}_n$. An edge between $\mathbf{T}_i$ and $\mathbf{T}_j$ carries a measured relative transform $\Delta\mathbf{T}_{ij}$. In Lie algebra: \begin{equation} \Delta\boldsymbol{\xi}_{ij} = \boldsymbol{\xi}_i^{-1}\circ\boldsymbol{\xi}_j = \ln\!\bigl(\mathbf{T}_i^{-1}\mathbf{T}_j\bigr)^{\vee} \tag{9.2} \end{equation} or equivalently in SE(3): \begin{equation} \mathbf{T}_{ij} = \mathbf{T}_i^{-1}\mathbf{T}_j. \tag{9.3} \end{equation} The residual moves the measurement to the right-hand side: \begin{equation} \mathbf{e}_{ij} = \ln\!\bigl(\mathbf{T}_{ij}^{-1}\mathbf{T}_i^{-1}\mathbf{T}_j\bigr)^{\vee}. \tag{9.4} \end{equation} To linearize, perturb both poses with left disturbances $\delta\boldsymbol{\xi}_i$, $\delta\boldsymbol{\xi}_j$: \begin{equation} \hat{e}_{ij} = \ln\!\bigl(\mathbf{T}_{ij}^{-1}\mathbf{T}_i^{-1} \exp((-\delta\boldsymbol{\xi}_i)^{\wedge})\, \exp(\delta\boldsymbol{\xi}_j^{\wedge})\,\mathbf{T}_j\bigr)^{\vee}. \tag{9.5} \end{equation} We need to shuffle the disturbance terms to one side for the BCH approximation. The key identity (adjoint property) is: \begin{equation} \exp\!\bigl((\mathrm{Ad}(\mathbf{T})\boldsymbol{\xi})^{\wedge}\bigr) = \mathbf{T}\,\exp(\boldsymbol{\xi}^{\wedge})\,\mathbf{T}^{-1}, \tag{9.6} \end{equation} which rearranges to: \begin{equation} \exp(\boldsymbol{\xi}^{\wedge})\,\mathbf{T} = \mathbf{T}\,\exp\!\Bigl(\bigl(\mathrm{Ad}(\mathbf{T}^{-1})\boldsymbol{\xi}\bigr)^{\wedge}\Bigr). \tag{9.7} \end{equation} Using this to move the disturbances rightward and applying the first-order BCH approximation: \begin{equation} \begin{aligned} \hat{e}_{ij} &= \ln\!\bigl(\mathbf{T}_{ij}^{-1}\mathbf{T}_i^{-1} \exp((-\delta\boldsymbol{\xi}_i)^{\wedge})\, \exp(\delta\boldsymbol{\xi}_j^{\wedge})\,\mathbf{T}_j\bigr)^{\vee} \\ &= \ln\!\Bigl(\mathbf{T}_{ij}^{-1}\mathbf{T}_i^{-1}\mathbf{T}_j \exp\!\bigl((-\mathrm{Ad}(\mathbf{T}_j^{-1})\delta\boldsymbol{\xi}_i)^{\wedge}\bigr)\, \exp\!\bigl((\mathrm{Ad}(\mathbf{T}_j^{-1})\delta\boldsymbol{\xi}_j)^{\wedge}\bigr) \Bigr)^{\vee} \\ &\approx \ln\!\Bigl(\mathbf{T}_{ij}^{-1}\mathbf{T}_i^{-1}\mathbf{T}_j \bigl[\mathbf{I} - (\mathrm{Ad}(\mathbf{T}_j^{-1})\delta\boldsymbol{\xi}_i)^{\wedge} + (\mathrm{Ad}(\mathbf{T}_j^{-1})\delta\boldsymbol{\xi}_j)^{\wedge} \bigr]\Bigr)^{\vee} \\ &\approx \mathbf{e}_{ij} + \frac{\partial \mathbf{e}_{ij}}{\partial\delta\boldsymbol{\xi}_i}\,\delta\boldsymbol{\xi}_i + \frac{\partial \mathbf{e}_{ij}}{\partial\delta\boldsymbol{\xi}_j}\,\delta\boldsymbol{\xi}_j \end{aligned} \tag{9.8} \end{equation} The two Jacobians are: \begin{equation} \frac{\partial \mathbf{e}_{ij}}{\partial\delta\boldsymbol{\xi}_i} = -\mathcal{J}_r^{-1}(\mathbf{e}_{ij})\,\mathrm{Ad}(\mathbf{T}_j^{-1}) \tag{9.9} \end{equation} \begin{equation} \frac{\partial \mathbf{e}_{ij}}{\partial\delta\boldsymbol{\xi}_j} = \mathcal{J}_r^{-1}(\mathbf{e}_{ij})\,\mathrm{Ad}(\mathbf{T}_j^{-1}). \tag{9.10} \end{equation} The right Jacobian $\mathcal{J}_r$ on $\mathfrak{se}(3)$ is complex. When the error is small: \begin{equation} \mathcal{J}_r^{-1}(\mathbf{e}_{ij}) \approx \mathbf{I} + \frac{1}{2} \begin{bmatrix} \boldsymbol{\phi}_e^{\wedge} & \boldsymbol{\rho}_e^{\wedge} \\ \mathbf{0} & \boldsymbol{\phi}_e^{\wedge} \end{bmatrix}. \tag{9.11} \end{equation} In practice the residuals may not be small after optimization (inconsistent edges, outliers), so setting $\mathcal{J}_r\approx\mathbf{I}$ introduces some error---but it often works well enough.

Overall Objective

With the Jacobians in hand, pose graph optimization is standard least squares. Let $\mathcal{E}$ be the set of all edges: \begin{equation} \min\;\frac{1}{2}\sum_{i,j\in\mathcal{E}} \mathbf{e}_{ij}^{T}\,\boldsymbol{\Sigma}_{ij}^{-1}\,\mathbf{e}_{ij}. \tag{9.12} \end{equation} Solve with Gauss--Newton, Levenberg--Marquardt, etc.

References

  1. Visual SLAM: From Theory to Practice, by Xiang Gao, Tao Zhang, Qinrui Yan and Yi Liu

If you find this article helpful, please consider citing it in your research:
@article{Hu2026learnvslam,
  author = {Hu, Yafei},
  title = {Learning Visual SLAM: Pose Graph Optimization},
  year = {2026}
}