Real-Time 6-DoF Pose Estimation

2025/07/27

Overview

During the summer camp, inspired by the concept of engineering robots in the RoboMaster Super League, each team was tasked with building a robot to complete the missions of collecting and exchanging minerals. My responsibility was designing the vision algorithm to meet the robot’s visual localization requirements.

I designed and implemented a real-time target pose estimation system (6 degrees of freedom) that runs on an embedded platform (Raspberry Pi 5). This system could calculate the 3D coordinates of the target (a yellow calibration board above the exchange slot) and transmit them to the downstream MCU via a serial port.

The Machine for exchanging minerals

Initially, I adopted a purely traditional OpenCV-based approach, which combined mask filtering with the centroid method. However, during practical testing, I discovered that this algorithm was highly sensitive to environmental factors. Additionally, performing a full-frame search on high-resolution images incurred significant computational overhead.

Therefore, I decided to incorporate deep learning into the solution. I trained a YOLO model to identify the calibration board region. This “Coarse Localization + Precise Detection” design offered three key advantages:

  1. Improved Robustness: YOLO excels at handling complex backgrounds and partial occlusions. By first identifying the region of interest (ROI), it effectively filtered out background interference.

  2. Accuracy Assurance: While deep learning models may not achieve sub-pixel precision in directly regressing corner coordinates, traditional geometric algorithms can. By retaining OpenCV’s sub-pixel corner detection, I ensured the geometric accuracy of the final pose estimation.

  3. Optimized Computational Performance: Once the ROI was obtained using YOLO, OpenCV only needed to search for corner points within a very small image region, significantly reducing CPU computation.

Below is the overall Pipeline Diagram of the system:

System Pipeline Diagram

Modules

Config

The Config module serves as the centralized parameter management layer for the entire system. Instead of hardcoding algorithm parameters, all tunable settings are externalized into a YAML configuration file ( assets/config.yml ). The system accepts three categories of parameters:

TIP: If you switch cameras, simply update the config and calibration file without recompiling.

The Config class wraps OpenCV’s FileStorage API to parse YAML in a standard, cross-platform manner:

bool Config::loadConfig(const std::string &config_file_path)
{
    cv::FileStorage fs(config_file_path, cv::FileStorage::READ);
    // Parse camera settings
    fs["camera_id"] >> camera_id;
    fs["camera_width"] >> width;
    // ... parse detection parameters

    // Parse YOLO parameters with graceful degradation
    const cv::FileNode yolo_node = fs["yolo"];
    if (!yolo_node.empty()) {
        yolo_node["model_path"] >> yolo_params.model_path;
        // ... parse YOLO-specific settings
    } else {
        // Provide safe defaults if YOLO section is missing
        yolo_params.model_path = "";  // disables YOLO
    }
    return true;
}

The implementation follows a graceful degradation pattern: if the YOLO section is missing from the config file, the system logs a warning and assigns safe defaults, allowing the application to run without crashing.


YoloDetector

Model Training

1. Dataset Construction

2. Finding the Best Model

To identify the optimal model for this task, I experimented with six variants: YOLOv8n, YOLOv8s, YOLOv8m, YOLOv11n, YOLOv11s, and YOLOv11m. Each model was trained under identical conditions, and their performance was evaluated using the following metrics:

After analyzing the results, YOLOv11s achieved the best balance across all metrics:

YOLO11s Training Result

I trained the model using the following configuration:

from ultralytics import YOLO

if __name__ == "__main__":
    model = YOLO(r"yolo11s.pt")
    model.train(
        data=r"yellow_board.yaml",  # Dataset configuration
        epochs=50,                   # Number of training epochs
        imgsz=640,                   # Input image size
        batch=-1,                    # Auto-adjust batch size
        cache="ram",                 # Cache dataset in RAM for faster training
        workers=1,                   # Number of data loader workers
        project="result",            # Output directory
        name="yolo11s",              # Experiment name
    )

The training process was visualized in real-time, and the final results demonstrated significant improvements in both precision and recall:

YOLO Training Process (video)

3. Exporting to ONNX Format

YOLOv11 models are trained in PyTorch format (.pt), but the C++ inference runtime expects ONNX (Open Neural Network Exchange) format for cross-platform portability. Export is straightforward:

model = YOLO('best.pt')
model.export(format='onnx', imgsz=640, opset=12, simplify=True)

The exported ONNX model expects input shape [1, 3, 640, 640] (batch size 1, RGB channels, 640×640 pixels) and outputs [1, 4 + num_classes, num_anchors] — exactly what the YoloDetector’s postprocess() expects.

Inference Pipeline

The YoloDetector follows a standard three-step inference pattern:

Input Frame (1280×720, BGR)
        │
        ▼ preprocess()
        │  · Compute uniform scale: min(640/1280, 640/720) ≈ 0.889
        │  · Resize to 1138×640, centre on 640×640 grey canvas (114,114,114)
        │  · Compute offset: (51, 0) for letterbox padding
        │  · Normalize [0, 255] → [0, 1]
        │  · BGR → RGB via swapRB flag
        │  · blobFromImage: HWC to CHW, dtype float32
        │
        ▼ blob [1, 3, 640, 640] -- (1 batch, 3 channels, H×W)
        │
        ▼ net_.forward() -- (Runs YOLOv11s inference on GPU/CPU)
        │
        ▼ output [1, 4 + num_classes, num_anchors] -- (Raw tensor from ONNX model)
        │
        ▼ postprocess()
        │  ① Reshape [1, 4+nc, anchors] → [4+nc, anchors]
        │  ② Transpose → [anchors, 4+nc] -- (each row = one anchor)
        │  ③ For each anchor:
        │     · Find argmax class: max(row[4:4+nc])
        │     · Filter: if max_score < conf_thresh (0.50) → skip
        │     · Reverse letterbox: cx_frame = (cx_blob - offset.x) / scale
        │     · Convert centre → top-left: x = cx - w/2, y = cy - h/2
        │     · Clamp to frame bounds: [0, 1280]×[0, 720]
        │  ④ cv::dnn::NMSBoxes() removes overlaps (IoU > 0.40)
        │  ⑤ std::sort() by descending confidence
        │
        ▼ vector<Detection> -- (sorted by confidence)
        │  Each Detection: {box (x,y,w,h), confidence, classId}
        │
        ▼ getBestBox()
        │  Returns detections.front().box -- (highest confidence)
        │  or empty cv::Rect() if no detection survives

1. Preprocessing: Letterboxing for Precision

cv::Mat preprocess(const cv::Mat &frame, float &scale, cv::Point2f &offset) const

YOLO models trained with letterbox preprocessing must be inferred with the same preprocessing, otherwise coordinate mapping breaks down. The scale and offset computed are explicitly returned as output parameters and passed to postprocess() for coordinate transformation.

2. Postprocessing: Decoding the YOLO Tensor

(a) Reshape and Transpose

cv::Mat data = output.reshape(1, output.size[1]);
cv::transpose(data, data);

(b) Find the Argmax Class

For each anchor, identify the class with the highest confidence:

float max_score = 0.f;
int best_cls = 0;
for (int c = 0; c < num_classes; ++c) {
    if (row[4 + c] > max_score) {
        max_score = row[4 + c];
        best_cls = c;
    }
}

This allows the system to handle multi-class scenarios (in this project, we only care about the “yellow board” class).

(c) Coordinate Transformation

// Remove letterbox offset
cx_frame = (cx_blob - offset.x) / scale;
cy_frame = (cy_blob - offset.y) / scale;
bw_frame = bw_blob / scale;
bh_frame = bh_blob / scale;

// Convert from centre format to top-left format
x = cx_frame - bw_frame / 2;
y = cy_frame - bh_frame / 2;

This inverse transformation is the sine qua non of correct multi-scale detection — get it wrong by even a few pixels, and the downstream PoseTracker will search in the wrong ROI.

(d) Boundary Clamping

x = clamp(x, 0, frame.width - 1);
y = clamp(y, 0, frame.height - 1);
w = clamp(w, 1, frame.width - x);
h = clamp(h, 1, frame.height - y);

This prevents out-of-bounds array access when the model predicts boxes that partially lie outside the frame.

(e) Non-Maximum Suppression (NMS)

cv::dnn::NMSBoxes(boxes, confidences, conf_thresh_, nms_thresh_, indices);

(f) Sorting by Confidence

The results are sorted in descending confidence order, so the highest-confidence detection can be trivially extracted via getBestBox(), which simply returns detections.front().box.

3. Integration with PoseTracker

After YOLO detection, the best bounding box is passed to PoseTracker::processRoi(). The padding parameter in config.yml is crucial here — if the YOLO box is too tight, it may crop the corner markers at the edges. The default 20-pixel expansion provides a safety margin.


PoseTracker

Corner Detection

Corner Detection Pipeline

1. Yellow Region Extraction

cv::cvtColor(frame, imgHSV, cv::COLOR_BGR2HSV);
inRange(imgHSV, config.detector_params.hsv_yellow_lower,
        config.detector_params.hsv_yellow_upper, yellowMask);
cv::Mat kernel_large = getStructuringElement(cv::MORPH_RECT,
        cv::Size(config.detector_params.yellow_close_kernel_size, ...));
morphologyEx(yellowMask, yellowMask, cv::MORPH_CLOSE, kernel_large);

First, convert the frame to HSV colour space. HSV is far superior to RGB for colour-based segmentation because it separates colour information (H, S) from intensity (V), making it robust to lighting variations.

The inRange() call isolates pixels within the yellow HSV bounds.

However, due to shadows, compression artifacts, or uneven lighting, the yellow region may contain small dark holes. Morphological closing (a dilation followed by an erosion) bridges these gaps, filling small holes while preserving the outer boundary of the yellow region.

2. Dark Marker Extraction

cv::cvtColor(frame, imgGray, cv::COLOR_BGR2GRAY);
threshold(imgGray, blackMask, config.detector_params.grayscale_thresh_inv,
          255, cv::THRESH_BINARY_INV);

Separately, convert the frame to grayscale and apply inverse thresholding. The threshold value isolates pixels darker than this threshold. Inverse thresholding (THRESH_BINARY_INV) inverts the result, so dark regions become white (255).

3. Intersection and Denoising

bitwise_and(yellowMask, blackMask, imgMask);
cv::Mat kernel_small = getStructuringElement(cv::MORPH_RECT,
        cv::Size(config.detector_params.marker_open_kernel_size, ...));
morphologyEx(imgMask, imgMask, cv::MORPH_OPEN, kernel_small);

The intersection of these two masks (bitwise_and) yields the “black-on-yellow” markers.

Morphological opening (an erosion followed by a dilation) then removes small noise blobs that survived the prior filters. The kernel size is small to avoid eroding the actual marker corners.

4. Contour-Based Corner Localization

For each connected component (contour) in the binary mask:

Finally, accept the detection only if exactly four corners are found.

5. Corner Reordering for Consistency

The reorderCorners() method imposes a canonical ordering on the four detected corners:

This ordering ensures consistent correspondence between the detected 2D corners and the pre-computed 3D model.

Pose Estimation

Once four accurate 2D-3D correspondences are established, estimatePose() invokes OpenCV’s PnP solver:

void PoseTracker::estimatePose()
{
    cv::solvePnP(boardCorners3D, boardCorners2D, cameraMatrix, distCoefficients,
                 r_vec, t_vec);
}

Inputs

Outputs:

What is PnP ? Perspective-n-Point is a classical problem in computer vision: given n known 3D points and their projections onto the image plane, recover the camera’s pose (rotation and translation) relative to the 3D coordinate system.

ROI Processing

The processRoi() method extends the full-frame logic to handle YOLO’s bounding-box output:

bool PoseTracker::processRoi(const cv::Mat &frame, const cv::Rect &roi)
{
    // Validate ROI; fall back to full-frame processing if invalid.
    cv::Rect safe_roi = roi & cv::Rect(0, 0, frame.cols, frame.rows);
    if (safe_roi.empty())
        return processFrame(frame);

    // Run corner detection on the cropped sub-image.
    cv::Mat cropped = frame(safe_roi);
    findCorners(cropped);

    if (!boardCorners2D.empty())
    {
        // Map detected corners from ROI-local coords back to full-frame coords.
        for (auto &pt : boardCorners2D)
        {
            pt.x += static_cast<float>(safe_roi.x);
            pt.y += static_cast<float>(safe_roi.y);
        }
        estimatePose();
        isPoseFound = true;
    } else {
        isPoseFound = false;
    }
    return isPoseFound;
}

This design enables the two-stage pipeline: YOLO reduces the search space, PoseTracker performs precise corner detection within that reduced space, and the coordinate mapping ensures correctness.


Visualizer

Rendering Flow

draw()
  ├─→ drawStatus()          [Always: "Tracking" or "Searching..."]
  └─→ if (isTargetFound)
       ├─→ drawDetection()  [Quadrilateral + corner circles + centroid]
       └─→ drawPose()       [3D axes via projectPoints() + numerical text]

3D-to-2D Projection

The technical part of visualization involves converting 3D board-space coordinates to 2D image coordinates:

cv::projectPoints(axisPoints3D,     // 3D axis endpoints (0.02m along each axis)
                  r_vec,            // Rotation from PnP
                  t_vec,            // Translation from PnP
                  cameraMatrix,     // Camera intrinsics
                  distCoefficients, // Lens distortion
                  axisPoints2D);    // Output: 2D projected points

The Visualizer renders multiple layers of information, each serving a specific diagnostic or informational purpose:

ElementColorTypeLocationPurpose
Status TextWhiteTextTop-left (20, 30)“Tracking” or “Searching…” feedback
X-axis ArrowRedArrow (3px)From originLeft-right motion indicator
Y-axis ArrowGreenArrow (3px)From originUp-down motion indicator
Z-axis ArrowBlueArrow (3px)From originForward-backward indicator
Total DistanceYellowTextTop-left (20, 150)Euclidean distance to the board
YOLO Bounding BoxPurpleRectangle (2px)Around ROICoarse detection region (from main.cpp)
Corner MarkersRedCircles (r=8px)At each cornerHighlights precise corner localization
Corner QuadrilateralCyanLines (2px)On boardConnects four corners & shows board orientation
Centroid MarkerRedCircle (r=8px)Board centreReference point for board position

Real-time Visual Demo

Raspberry Pi 5

SerialPort

The SerialPort module encapsulates serial communication logic, enabling the pose estimation results to be transmitted to a downstream embedded controller (e.g. robot’s motor control MCU).

Protocol Design

The pose transmission format is deliberately minimalist (3-DoF translation):

S,<x>,<y>,<z>,E\n

If full 6-DoF transmission becomes necessary, the protocol can be extended without refactoring:

// Optional 6-DoF extension:
S,<x>,<y>,<z>,<roll>,<pitch>,<yaw>,E\n
// or with quaternions:
S,<x>,<y>,<z>,<qw>,<qx>,<qy>,<qz>,E\n

The default serial port and baud rate are hardcoded (To be optimized: set in Config module) in main.cpp:

SerialPort communicator("/dev/ttyAMA4", 115200);

However, this is easily customized for different hardware:

// USB-to-serial adapter (common on PC)
SerialPort communicator("/dev/ttyUSB0", 115200);

// Lower baud rate for long cables or noisy environments
SerialPort communicator("/dev/ttyAMA4", 9600);

// Windows COM port
SerialPort communicator("COM3", 115200);

Implementation

The SerialPort class wraps Boost.Asio’s serial port functionality:

SerialPort::SerialPort(const std::string &port_name, unsigned int baud_rate)
    : m_io_service(), m_port(m_io_service)
{
    try {
        m_port.open(port_name);
        // Configure 8N1: 8 data bits, no parity, 1 stop bit
        m_port.set_option(boost::asio::serial_port_base::baud_rate(baud_rate));
        m_port.set_option(boost::asio::serial_port_base::character_size(8));
        m_port.set_option(boost::asio::serial_port_base::parity(boost::asio::serial_port_base::parity::none));
        m_port.set_option(boost::asio::serial_port_base::stop_bits(boost::asio::serial_port_base::stop_bits::one));
        m_port.set_option(boost::asio::serial_port_base::flow_control(boost::asio::serial_port_base::flow_control::none));
        std::cout << "Serial port opened at " << baud_rate << " baud." << std::endl;
    } catch (const boost::system::system_error &e) {
        std::cerr << "Error opening serial port: " << e.what() << std::endl;
    }
}

Transmission (To be optimized)

The sendPose() method formats and transmits a pose packet:

bool SerialPort::sendPose(double x, double y, double z)
{
    if (!m_port.is_open())
        return false;

    std::stringstream ss;
    ss << "S," << x << "," << y << "," << z << ",E\n";
    std::string data = ss.str();

    try {
        boost::asio::write(m_port, boost::asio::buffer(data));
        return true;
    } catch (const boost::system::system_error &e) {
        std::cerr << "Error sending data: " << e.what() << std::endl;
        return false;
    }
}

Future Extensions


Deployment

Raspberry Pi Configuration

1. Serial Port Settings

dtoverlay -a | grep uart
dtoverlay -h uart4

sudo vim /boot/firmware/config.txt
# Add "dtoverlay=uart4" at the end
sudo reboot

ls /dev/ttyAMA*
# "/dev/ttyAMA4" should appear

2. Hardware Connection

GPIO14 = TXD0 -> ttyAMA0
GPIO0  = TXD2 -> ttyAMA1
GPIO4  = TXD3 -> ttyAMA2
GPIO8  = TXD4 -> ttyAMA3
GPIO12 = TXD5 -> ttyAMA4 [√]

GPIO15 = RXD0 -> ttyAMA0
GPIO1  = RXD2 -> ttyAMA1
GPIO5  = RXD3 -> ttyAMA2
GPIO9  = RXD4 -> ttyAMA3
GPIO13 = RXD5 -> ttyAMA4 [√]

3. Debugging

sudo apt-get install minicom
sudo minicom -D /dev/ttyAMA4 -b 115200
Serial Port Debugging