Overview

During the summer camp, inspired by the concept of engineering robots in the RoboMaster Super League, each team was tasked with building a robot to complete the missions of collecting and exchanging minerals. My responsibility was designing the vision algorithm to meet the robot’s visual localization requirements.

I designed and implemented a real-time target pose estimation system (6 degrees of freedom) that runs on an embedded platform (Raspberry Pi 5). This system could calculate the 3D coordinates of the target (a yellow calibration board above the exchange slot) and transmit them to the downstream MCU via a serial port.

Initially, I adopted a purely traditional OpenCV-based approach, which combined mask filtering with the centroid method. However, during practical testing, I discovered that this algorithm was highly sensitive to environmental factors. Additionally, performing a full-frame search on high-resolution images incurred significant computational overhead.

Therefore, I decided to incorporate deep learning into the solution. I trained a YOLO model to identify the calibration board region. This “Coarse Localization + Precise Detection” design offered three key advantages:

Improved Robustness: YOLO excels at handling complex backgrounds and partial occlusions. By first identifying the region of interest (ROI), it effectively filtered out background interference.
Accuracy Assurance: While deep learning models may not achieve sub-pixel precision in directly regressing corner coordinates, traditional geometric algorithms can. By retaining OpenCV’s sub-pixel corner detection, I ensured the geometric accuracy of the final pose estimation.
Optimized Computational Performance: Once the ROI was obtained using YOLO, OpenCV only needed to search for corner points within a very small image region, significantly reducing CPU computation.

Below is the overall Pipeline Diagram of the system:

Modules

Config

The Config module serves as the centralized parameter management layer for the entire system. Instead of hardcoding algorithm parameters, all tunable settings are externalized into a YAML configuration file ( assets/config.yml ). The system accepts three categories of parameters:

Camera & Board: Device ID, capture resolution, path to camera calibration file, and the physical dimensions of the marker board.

TIP: If you switch cameras, simply update the config and calibration file without recompiling.

Marker Detection: HSV colour bounds for isolating the yellow board region, grayscale inversion threshold for locating dark corner markers, morphological kernel sizes for noise removal, and the minimum contour area to filter out spurious detections — These parameters are highly sensitive to lighting and background conditions in the deployment environment.
YOLO Detector: Model path, input resolution, confidence threshold, NMS threshold, and the extra padding around the detected ROI before feeding it to PoseTracker.

The Config class wraps OpenCV’s FileStorage API to parse YAML in a standard, cross-platform manner:

bool Config::loadConfig(const std::string &config_file_path)
{
    cv::FileStorage fs(config_file_path, cv::FileStorage::READ);
    // Parse camera settings
    fs["camera_id"] >> camera_id;
    fs["camera_width"] >> width;
    // ... parse detection parameters

    // Parse YOLO parameters with graceful degradation
    const cv::FileNode yolo_node = fs["yolo"];
    if (!yolo_node.empty()) {
        yolo_node["model_path"] >> yolo_params.model_path;
        // ... parse YOLO-specific settings
    } else {
        // Provide safe defaults if YOLO section is missing
        yolo_params.model_path = "";  // disables YOLO
    }
    return true;
}

The implementation follows a graceful degradation pattern: if the YOLO section is missing from the config file, the system logs a warning and assigns safe defaults, allowing the application to run without crashing.

YoloDetector

Model Training

1. Dataset Construction

Capture Environment: Images were recorded from the robot’s perspective, spanning multiple lighting conditions (indoor lighting, natural daylight, competition venue lighting), viewing angles (0° to 45° inclination), and distances (0.5 m to 3 m from the board).
Data Augmentation: To expand the dataset without manual recapture, I applied:
- Geometric Transforms: rotation (±15°) / scaling (0.8–1.2×) / horizontal flip
- Photometric Transforms: brightness adjustment (±20%) / contrast (±15%) / saturation (±10%)
- Dropout Augmentation: random erasing of image regions to simulate occlusion
This synthetic augmentation typically triples the effective dataset size.
Annotation Tool: I used labelImg, a popular open-source tool, to annotate the images. Each image’s annotation is stored in a .txt file with normalized bounding box coordinates: <class_id> <x_center> <y_center> <width> <height>. Since we have only one class, class_id is always 0.
I partitioned the dataset as follows:
- Training set: 70% (~400 images after augmentation)
- Validation set: 20% (~100 images)
- Test set: 10% (~50 images, held out until final evaluation)

2. Finding the Best Model

To identify the optimal model for this task, I experimented with six variants: YOLOv8n, YOLOv8s, YOLOv8m, YOLOv11n, YOLOv11s, and YOLOv11m. Each model was trained under identical conditions, and their performance was evaluated using the following metrics:

P (Precision): Measures the proportion of true positives among all positive predictions.
R (Recall): Measures the proportion of true positives among all actual positives.
PR (Precision-Recall Curve): Visualizes the trade-off between precision and recall at different thresholds.
F1 Score: The harmonic mean of precision and recall, providing a single metric for overall performance.

After analyzing the results, YOLOv11s achieved the best balance across all metrics:

I trained the model using the following configuration:

from ultralytics import YOLO

if __name__ == "__main__":
    model = YOLO(r"yolo11s.pt")
    model.train(
        data=r"yellow_board.yaml",  # Dataset configuration
        epochs=50,                   # Number of training epochs
        imgsz=640,                   # Input image size
        batch=-1,                    # Auto-adjust batch size
        cache="ram",                 # Cache dataset in RAM for faster training
        workers=1,                   # Number of data loader workers
        project="result",            # Output directory
        name="yolo11s",              # Experiment name
    )

The training process was visualized in real-time, and the final results demonstrated significant improvements in both precision and recall:

3. Exporting to ONNX Format

YOLOv11 models are trained in PyTorch format (.pt), but the C++ inference runtime expects ONNX (Open Neural Network Exchange) format for cross-platform portability. Export is straightforward:

model = YOLO('best.pt')
model.export(format='onnx', imgsz=640, opset=12, simplify=True)

The exported ONNX model expects input shape [1, 3, 640, 640] (batch size 1, RGB channels, 640×640 pixels) and outputs [1, 4 + num_classes, num_anchors] — exactly what the YoloDetector’s postprocess() expects.

Inference Pipeline

The YoloDetector follows a standard three-step inference pattern:

Input Frame (1280×720, BGR)
        │
        ▼ preprocess()
        │  · Compute uniform scale: min(640/1280, 640/720) ≈ 0.889
        │  · Resize to 1138×640, centre on 640×640 grey canvas (114,114,114)
        │  · Compute offset: (51, 0) for letterbox padding
        │  · Normalize [0, 255] → [0, 1]
        │  · BGR → RGB via swapRB flag
        │  · blobFromImage: HWC to CHW, dtype float32
        │
        ▼ blob [1, 3, 640, 640] -- (1 batch, 3 channels, H×W)
        │
        ▼ net_.forward() -- (Runs YOLOv11s inference on GPU/CPU)
        │
        ▼ output [1, 4 + num_classes, num_anchors] -- (Raw tensor from ONNX model)
        │
        ▼ postprocess()
        │  ① Reshape [1, 4+nc, anchors] → [4+nc, anchors]
        │  ② Transpose → [anchors, 4+nc] -- (each row = one anchor)
        │  ③ For each anchor:
        │     · Find argmax class: max(row[4:4+nc])
        │     · Filter: if max_score < conf_thresh (0.50) → skip
        │     · Reverse letterbox: cx_frame = (cx_blob - offset.x) / scale
        │     · Convert centre → top-left: x = cx - w/2, y = cy - h/2
        │     · Clamp to frame bounds: [0, 1280]×[0, 720]
        │  ④ cv::dnn::NMSBoxes() removes overlaps (IoU > 0.40)
        │  ⑤ std::sort() by descending confidence
        │
        ▼ vector<Detection> -- (sorted by confidence)
        │  Each Detection: {box (x,y,w,h), confidence, classId}
        │
        ▼ getBestBox()
        │  Returns detections.front().box -- (highest confidence)
        │  or empty cv::Rect() if no detection survives

1. Preprocessing: Letterboxing for Precision

cv::Mat preprocess(const cv::Mat &frame, float &scale, cv::Point2f &offset) const

YOLO models trained with letterbox preprocessing must be inferred with the same preprocessing, otherwise coordinate mapping breaks down. The scale and offset computed are explicitly returned as output parameters and passed to postprocess() for coordinate transformation.

2. Postprocessing: Decoding the YOLO Tensor

(a) Reshape and Transpose

cv::Mat data = output.reshape(1, output.size[1]);
cv::transpose(data, data);

Each row is one anchor prediction: [1, 4 + nc, anchors] → [4 + nc, anchors] → [anchors, 4 + nc]
First 4 channels: box centre (cx, cy) and size (w, h) in the 640×640 blob coordinate system.

(b) Find the Argmax Class

For each anchor, identify the class with the highest confidence:

float max_score = 0.f;
int best_cls = 0;
for (int c = 0; c < num_classes; ++c) {
    if (row[4 + c] > max_score) {
        max_score = row[4 + c];
        best_cls = c;
    }
}

This allows the system to handle multi-class scenarios (in this project, we only care about the “yellow board” class).

(c) Coordinate Transformation

// Remove letterbox offset
cx_frame = (cx_blob - offset.x) / scale;
cy_frame = (cy_blob - offset.y) / scale;
bw_frame = bw_blob / scale;
bh_frame = bh_blob / scale;

// Convert from centre format to top-left format
x = cx_frame - bw_frame / 2;
y = cy_frame - bh_frame / 2;

This inverse transformation is the sine qua non of correct multi-scale detection — get it wrong by even a few pixels, and the downstream PoseTracker will search in the wrong ROI.

(d) Boundary Clamping

x = clamp(x, 0, frame.width - 1);
y = clamp(y, 0, frame.height - 1);
w = clamp(w, 1, frame.width - x);
h = clamp(h, 1, frame.height - y);

This prevents out-of-bounds array access when the model predicts boxes that partially lie outside the frame.

(e) Non-Maximum Suppression (NMS)

cv::dnn::NMSBoxes(boxes, confidences, conf_thresh_, nms_thresh_, indices);

conf_thresh_: filters out detections below a confidence cutoff (default 0.50)
nms_thresh_: suppresses overlapping boxes with IoU > 40% (default 0.40)

(f) Sorting by Confidence

The results are sorted in descending confidence order, so the highest-confidence detection can be trivially extracted via getBestBox(), which simply returns detections.front().box.

3. Integration with PoseTracker

After YOLO detection, the best bounding box is passed to PoseTracker::processRoi(). The padding parameter in config.yml is crucial here — if the YOLO box is too tight, it may crop the corner markers at the edges. The default 20-pixel expansion provides a safety margin.

PoseTracker

Corner Detection

1. Yellow Region Extraction

cv::cvtColor(frame, imgHSV, cv::COLOR_BGR2HSV);
inRange(imgHSV, config.detector_params.hsv_yellow_lower,
        config.detector_params.hsv_yellow_upper, yellowMask);
cv::Mat kernel_large = getStructuringElement(cv::MORPH_RECT,
        cv::Size(config.detector_params.yellow_close_kernel_size, ...));
morphologyEx(yellowMask, yellowMask, cv::MORPH_CLOSE, kernel_large);

First, convert the frame to HSV colour space. HSV is far superior to RGB for colour-based segmentation because it separates colour information (H, S) from intensity (V), making it robust to lighting variations.

The inRange() call isolates pixels within the yellow HSV bounds.

However, due to shadows, compression artifacts, or uneven lighting, the yellow region may contain small dark holes. Morphological closing (a dilation followed by an erosion) bridges these gaps, filling small holes while preserving the outer boundary of the yellow region.

2. Dark Marker Extraction

cv::cvtColor(frame, imgGray, cv::COLOR_BGR2GRAY);
threshold(imgGray, blackMask, config.detector_params.grayscale_thresh_inv,
          255, cv::THRESH_BINARY_INV);

Separately, convert the frame to grayscale and apply inverse thresholding. The threshold value isolates pixels darker than this threshold. Inverse thresholding (THRESH_BINARY_INV) inverts the result, so dark regions become white (255).

3. Intersection and Denoising

bitwise_and(yellowMask, blackMask, imgMask);
cv::Mat kernel_small = getStructuringElement(cv::MORPH_RECT,
        cv::Size(config.detector_params.marker_open_kernel_size, ...));
morphologyEx(imgMask, imgMask, cv::MORPH_OPEN, kernel_small);

The intersection of these two masks (bitwise_and) yields the “black-on-yellow” markers.

Morphological opening (an erosion followed by a dilation) then removes small noise blobs that survived the prior filters. The kernel size is small to avoid eroding the actual marker corners.

4. Contour-Based Corner Localization

For each connected component (contour) in the binary mask:

Area filtering: Discard contours with area < min_marker_area to reject noise.
Centroid calculation: Compute the geometric centroid of the contour using moments.
Representative corner point: For each contour, find the contour point closest to the centroid. This provides a sub-pixel estimate that’s more stable than simply taking the contour’s bounding-box corners.

Finally, accept the detection only if exactly four corners are found.

5. Corner Reordering for Consistency

The reorderCorners() method imposes a canonical ordering on the four detected corners:

Top-left: Minimizes x + y (sum is smallest in upper-left)
Top-right: Maximizes x − y (difference is largest when x » y)
Bottom-right: Maximizes x + y (sum is largest in lower-right)
Bottom-left: Minimizes x − y (difference is smallest when y » x)

This ordering ensures consistent correspondence between the detected 2D corners and the pre-computed 3D model.

Pose Estimation

Once four accurate 2D-3D correspondences are established, estimatePose() invokes OpenCV’s PnP solver:

void PoseTracker::estimatePose()
{
    cv::solvePnP(boardCorners3D, boardCorners2D, cameraMatrix, distCoefficients,
                 r_vec, t_vec);
}

Inputs

boardCorners3D: Four known 3D corner positions in board space (in metres, Z = 0).
boardCorners2D: The four detected 2D corners in image space (in pixels).
cameraMatrix: 3×3 intrinsic camera matrix from calibration.
distCoefficients: Lens distortion coefficients from calibration.

Outputs:

r_vec: 3-element rotation vector (Rodrigues representation), encoding the 3×3 rotation matrix.
t_vec: 3-element translation vector (in metres), representing the position of the board’s origin relative to the camera.

What is PnP ? Perspective-n-Point is a classical problem in computer vision: given n known 3D points and their projections onto the image plane, recover the camera’s pose (rotation and translation) relative to the 3D coordinate system.

ROI Processing

The processRoi() method extends the full-frame logic to handle YOLO’s bounding-box output:

bool PoseTracker::processRoi(const cv::Mat &frame, const cv::Rect &roi)
{
    // Validate ROI; fall back to full-frame processing if invalid.
    cv::Rect safe_roi = roi & cv::Rect(0, 0, frame.cols, frame.rows);
    if (safe_roi.empty())
        return processFrame(frame);

    // Run corner detection on the cropped sub-image.
    cv::Mat cropped = frame(safe_roi);
    findCorners(cropped);

    if (!boardCorners2D.empty())
    {
        // Map detected corners from ROI-local coords back to full-frame coords.
        for (auto &pt : boardCorners2D)
        {
            pt.x += static_cast<float>(safe_roi.x);
            pt.y += static_cast<float>(safe_roi.y);
        }
        estimatePose();
        isPoseFound = true;
    } else {
        isPoseFound = false;
    }
    return isPoseFound;
}

This design enables the two-stage pipeline: YOLO reduces the search space, PoseTracker performs precise corner detection within that reduced space, and the coordinate mapping ensures correctness.

Visualizer

Rendering Flow

draw()
  ├─→ drawStatus()          [Always: "Tracking" or "Searching..."]
  └─→ if (isTargetFound)
       ├─→ drawDetection()  [Quadrilateral + corner circles + centroid]
       └─→ drawPose()       [3D axes via projectPoints() + numerical text]

3D-to-2D Projection

The technical part of visualization involves converting 3D board-space coordinates to 2D image coordinates:

cv::projectPoints(axisPoints3D,     // 3D axis endpoints (0.02m along each axis)
                  r_vec,            // Rotation from PnP
                  t_vec,            // Translation from PnP
                  cameraMatrix,     // Camera intrinsics
                  distCoefficients, // Lens distortion
                  axisPoints2D);    // Output: 2D projected points

The Visualizer renders multiple layers of information, each serving a specific diagnostic or informational purpose:

Element	Color	Type	Location	Purpose
Status Text	White	Text	Top-left (20, 30)	“Tracking” or “Searching…” feedback
X-axis Arrow	Red	Arrow (3px)	From origin	Left-right motion indicator
Y-axis Arrow	Green	Arrow (3px)	From origin	Up-down motion indicator
Z-axis Arrow	Blue	Arrow (3px)	From origin	Forward-backward indicator
Total Distance	Yellow	Text	Top-left (20, 150)	Euclidean distance to the board
YOLO Bounding Box	Purple	Rectangle (2px)	Around ROI	Coarse detection region (from main.cpp)
Corner Markers	Red	Circles (r=8px)	At each corner	Highlights precise corner localization
Corner Quadrilateral	Cyan	Lines (2px)	On board	Connects four corners & shows board orientation
Centroid Marker	Red	Circle (r=8px)	Board centre	Reference point for board position

Raspberry Pi 5

SerialPort

The SerialPort module encapsulates serial communication logic, enabling the pose estimation results to be transmitted to a downstream embedded controller (e.g. robot’s motor control MCU).

Protocol Design

The pose transmission format is deliberately minimalist (3-DoF translation):

S,<x>,<y>,<z>,E\n

If full 6-DoF transmission becomes necessary, the protocol can be extended without refactoring:

The PoseTracker module calculates the rotation vector (r_vec) via cv::solvePnP()
The getRotationVector() interface provides direct access to this data
Converting to Euler angles or quaternions is straightforward

// Optional 6-DoF extension:
S,<x>,<y>,<z>,<roll>,<pitch>,<yaw>,E\n
// or with quaternions:
S,<x>,<y>,<z>,<qw>,<qx>,<qy>,<qz>,E\n

The default serial port and baud rate are ~~hardcoded~~ (To be optimized: set in Config module) in main.cpp:

SerialPort communicator("/dev/ttyAMA4", 115200);

However, this is easily customized for different hardware:

// USB-to-serial adapter (common on PC)
SerialPort communicator("/dev/ttyUSB0", 115200);

// Lower baud rate for long cables or noisy environments
SerialPort communicator("/dev/ttyAMA4", 9600);

// Windows COM port
SerialPort communicator("COM3", 115200);

Implementation

The SerialPort class wraps Boost.Asio’s serial port functionality:

SerialPort::SerialPort(const std::string &port_name, unsigned int baud_rate)
    : m_io_service(), m_port(m_io_service)
{
    try {
        m_port.open(port_name);
        // Configure 8N1: 8 data bits, no parity, 1 stop bit
        m_port.set_option(boost::asio::serial_port_base::baud_rate(baud_rate));
        m_port.set_option(boost::asio::serial_port_base::character_size(8));
        m_port.set_option(boost::asio::serial_port_base::parity(boost::asio::serial_port_base::parity::none));
        m_port.set_option(boost::asio::serial_port_base::stop_bits(boost::asio::serial_port_base::stop_bits::one));
        m_port.set_option(boost::asio::serial_port_base::flow_control(boost::asio::serial_port_base::flow_control::none));
        std::cout << "Serial port opened at " << baud_rate << " baud." << std::endl;
    } catch (const boost::system::system_error &e) {
        std::cerr << "Error opening serial port: " << e.what() << std::endl;
    }
}

Transmission (To be optimized)

The sendPose() method formats and transmits a pose packet:

bool SerialPort::sendPose(double x, double y, double z)
{
    if (!m_port.is_open())
        return false;

    std::stringstream ss;
    ss << "S," << x << "," << y << "," << z << ",E\n";
    std::string data = ss.str();

    try {
        boost::asio::write(m_port, boost::asio::buffer(data));
        return true;
    } catch (const boost::system::system_error &e) {
        std::cerr << "Error sending data: " << e.what() << std::endl;
        return false;
    }
}

Synchronous write: boost::asio::write() blocks until all bytes are sent.
String formatting: std::stringstream provides automatic float-to-ASCII conversion with default precision.
Buffer management: boost::asio::buffer() wraps the string’s internal buffer; no additional memory allocation occurs.

Future Extensions

Asynchronous writes ! Use async_write() to avoid blocking the main vision loop
Bidirectional protocol: MCU sends pose confirmation; Pi retransmits on timeout
Confidence fields: Add corner detection score → MCU adjusts control gains

Deployment

1. Serial Port Settings

dtoverlay -a | grep uart
dtoverlay -h uart4

sudo vim /boot/firmware/config.txt
# Add "dtoverlay=uart4" at the end
sudo reboot

ls /dev/ttyAMA*
# "/dev/ttyAMA4" should appear

2. Hardware Connection

GPIO14 = TXD0 -> ttyAMA0
GPIO0  = TXD2 -> ttyAMA1
GPIO4  = TXD3 -> ttyAMA2
GPIO8  = TXD4 -> ttyAMA3
GPIO12 = TXD5 -> ttyAMA4 [√]

GPIO15 = RXD0 -> ttyAMA0
GPIO1  = RXD2 -> ttyAMA1
GPIO5  = RXD3 -> ttyAMA2
GPIO9  = RXD4 -> ttyAMA3
GPIO13 = RXD5 -> ttyAMA4 [√]

3. Debugging

sudo apt-get install minicom
sudo minicom -D /dev/ttyAMA4 -b 115200

Real-Time 6-DoF Pose Estimation

2025/07/27