Floor plane estimation

Estimating a floor plane from an averaged point cloud captured with an OAK-D Pro camera, with the RGB frame shown only as room context.

2026-04-18

Frikk Fossdal suggested that I write about some of the projects I work on, which makes this post a good excuse to test a blogging tool I am building for generating animations and diagrams with coding agents.

One of those projects is a stationary robot that observes the room around it with an OAK-D Pro camera, using an averaged point cloud from its depth sensor to answer a basic geometric question: where is the floor?

A reliable floor estimate gives the robot a local reference frame and lets us check whether the people-pose estimates from the camera's AI pipeline make physical sense.

RGB reference frame and the point cloud built by averaging 50 depth frames at the same pose. Colour encodes depth: blue is close, teal is far. Orbit with your mouse.

Averaging 50 depth frames reduces long-range sensor noise. The result is still messy, but the dominant surfaces are coherent enough to start fitting planes.

Sampling candidate planes

We cannot search every possible plane orientation because the space is continuous. Instead we use RANSAC: draw three random points, fit the plane through them, and count nearby points as inliers.

A point is an inlier when its perpendicular distance to the plane is below a depth-adaptive threshold, roughly 2 % of view depth clamped between 3 cm and 12 cm. We run 192 samples and keep the 12 best candidates.

Each iteration draws three seed points (white circles), fits a plane through them, and highlights the inliers in the existing point cloud. After several iterations the best candidate is shown.

The ranking combines two ideas: how tightly inliers cluster around the plane, and how evenly they cover it. First, the residuals.

Measuring fit

For each inlier, we compute the signed perpendicular distance to the candidate plane. The root-mean-square of those distances is the RMS error, the main fit measure.

The plane sweeps automatically. Each line is the perpendicular distance from a point to the plane; the bar chart below shows the same distances per point. You can also drag the handles to explore.

The plane above sweeps through angles and offsets so you can watch the error bars and RMS readout respond. When the plane crosses the dense floor cluster, the bars shrink. When it tilts away, they grow.

Refining with Gauss-Newton

The best RANSAC candidate is only a starting point. A Gauss-Newton solver refines it with up to 96 damped least-squares steps. Each step updates the two angles that parameterise the plane normal and the perpendicular offset.

The update at each step solves

$(\mathbf{J}^\top \mathbf{J} + \lambda\,\mathbf{I})\,\boldsymbol{\delta} = -\mathbf{J}^\top \mathbf{r}$

where $\mathbf{J}$ is the Jacobian of the residuals with respect to the three plane parameters, $\mathbf{r}$ is the vector of signed distances, and $\lambda = 5 \times 10^{-3}$ is a Levenberg–Marquardt damping term. Convergence is declared when $\lVert\boldsymbol{\delta}\rVert$ drops below $10^{-5}$ .

Starting from the RANSAC candidate (dashed) the solver takes discrete steps toward the optimal plane. The RMS readout and error bars update at each step.

For a clean floor cluster, convergence usually takes fewer than 20 steps. The damping keeps updates stable when the Jacobian is nearly singular, such as when the candidate plane is nearly parallel to the point cloud's principal axis.

What kind of surface?

Low RMS is not enough. A wall or ceiling can also be flat. A floor should be large and evenly covered with points. Clutter may fit tightly but have a small footprint; a partial wall may be long and thin.

We measure this by projecting inliers onto the candidate plane as a top-down view, then overlaying a 24 × 24 grid. Two numbers come out:

fill_ratio: the fraction of cells inside the inlier bounding rectangle that contain at least one point. A large floor fills most of its footprint; a sparse surface leaves gaps.

density_cv: the coefficient of variation of per-cell point counts. Uniform surfaces score low; clustered surfaces score high.

The combined score is fill_ratio / (1 + density_cv). Higher is better.

Top-down projection of two candidate surfaces overlaid with the 24 × 24 scoring grid. The floor (left) scores well on both metrics; the cluttered surface (right) does not.

This rejects many walls even when their RMS is low, because a wall usually covers only a thin strip of the grid.

Learning from feedback

Not every frame has an unobstructed view of the floor. Furniture, people, and oblique angles can all produce plausible candidates that are not actually the floor.

During teaching, a human can confirm or reject proposed surfaces. A rejection excludes that inlier cluster from the next search, while a confirmation saves a surface profile.

That human-in-the-loop step is only needed for teaching. Once the system has a useful set of confirmed profiles, experience mode can compare new candidates against them on its own and promote surfaces that look similar.

The profile stores the mean and variance of absolute perpendicular distances, the fill_ratio, density per square metre, density_cv, and the footprint aspect ratio. These values become the features used for later comparison.

Variance is useful because it describes how tightly points cluster around a surface. A confirmed floor tends to have a narrow residual distribution. A rough or mixed surface spreads its residuals out.

Two candidate surfaces with individual residuals and fitted normal distributions. The low-variance floor profile is much tighter than the high-variance candidate.

On later runs, experience mode compares new candidates with saved profiles. A surface that matches earlier floor confirmations rises to the top faster without requiring a person to re-teach the system on every frame.

Putting it together

The pipeline averages 50 depth frames, samples 192 RANSAC triplets, keeps 12 candidates, refines with Gauss-Newton, scores by RMS and spatial density, then learns confirmed surface profiles.

The result is a stable, pose-relative floor frame the robot can use as a foundation for everything else.