IFrontend

The IFrontend interface provides a unified module for joint estimation of depth and optical flow from stereo image pairs, with optional uncertainty estimation for both tasks.

Why an additional layer of abstraction?

Sometime the depth estimation and matching are tightly coupled, so we need a way to combine them. For instance, if depth (using disparity) and matching uses the same network with same weight, instead of inference twice in sequential mannor, we can compose a batch with size of 2 and inference once.

How to use this?

If there's no specific need (e.g. for performance improvement mentioned above), just use the FrontendCompose to combine an IStereoDepth and an IMatcher. This should work just fine.
Otherwise implement a new IFrontend and plug it in the pipeline.

Interface

class IFrontend(ABC, Generic[T_Context], ConfigTestableSubclass):
    @property
    @abstractmethod
    def provide_cov(self) -> tuple[bool, bool]: ...
    
    @abstractmethod
    def init_context(self) -> T_Context: ...
    
    @overload
    @abstractmethod
    def estimate(self, frame_t1: None, frame_t2: StereoData) -> tuple[IStereoDepth.Output, None]: ...
    
    @overload
    @abstractmethod
    def estimate(self, frame_t1: StereoData, frame_t2: StereoData) -> tuple[IStereoDepth.Output, IMatcher.Output]: ...

Output Structure

The interface returns a tuple of outputs from both depth and flow estimation:

IStereoDepth.Output: Depth estimation results
- See IStereoDepth documentation for details
IMatcher.Output or None: Flow estimation results (if frame_t1 is provided)
- See IMatcher documentation for details

Methods to Implement

provide_cov -> tuple[bool, bool]
- Property indicating whether the implementation provides uncertainty estimation
- Returns (depth_cov_enabled, flow_cov_enabled)
- Must return True for each component if the implementation outputs its uncertainty
init_context() -> T_Context
- Initializes model-specific context (e.g., neural networks, parameters)
- Called during initialization
- Access configuration via self.config
estimate(frame_t1: Optional[StereoData], frame_t2: StereoData) -> tuple[IStereoDepth.Output, Optional[IMatcher.Output]]
- Core method for joint depth and flow estimation
- Input frames contain stereo image pairs (imageL, imageR) of shape B×3×H×W
- If frame_t1 is None, only performs depth estimation
- Returns tuple of depth and optional flow outputs
- May pad outputs with nan if prediction shape differs from input

Implementations

Base Models

FrontendCompose
- Combines separate depth and flow estimators
- Uses individual IStereoDepth and IMatcher implementations
- Provides covariance if underlying implementations do
- Configuration:
  - depth: Configuration for depth estimator
    - type: IStereoDepth implementation class name
    - args: Arguments for depth estimator
  - match: Configuration for flow estimator
    - type: IMatcher implementation class name
    - args: Arguments for flow estimator
FlowFormerCovFrontend
- Main frontend used in MAC-VO for joint depth and flow estimation
- Uses FlowFormer network with covariance estimation
- Provides covariance for both depth and flow
- Configuration:
  - weight: Path to model weights
  - device: Target device ("cuda" or "cpu")
  - dtype: Model precision ("fp32", "bf16", or "fp16")
  - enforce_positive_disparity: Whether to enforce positive disparity values
  - max_flow: Maximum allowed flow value (-1 for no limit)

Accelerated Models

CUDAGraph_FlowFormerCovFrontend
- Accelerated version of FlowFormerCovFrontend using CUDA graphs
- Improves inference speed by minimize kernal launching overhead.
- Only available on CUDA devices
- Same configuration as FlowFormerCovFrontend
- Additional optimizations:
  - Uses tensor cores (TF32)
  - Reduced precision matrix multiplication
  - CUDA solver optimization

Usage in MAC-VO

The IFrontend interface is primarily used in:

Visual odometry pipeline for joint depth and flow estimation
Evaluation and benchmarking
Visualization and debugging

Example usage:

frontend = IFrontend.instantiate(config.frontend.type, config.frontend.args)

# Depth estimation only
depth_output, _ = frontend.estimate(None, frame_t2)

# Joint depth and flow estimation
depth_output, flow_output = frontend.estimate(frame_t1, frame_t2)

# Access depth results
depth = depth_output.depth          # B×1×H×W tensor
depth_cov = depth_output.cov       # B×1×H×W tensor or None
depth_mask = depth_output.mask     # B×1×H×W tensor or None

# Access flow results (if available)
if flow_output is not None:
    flow = flow_output.flow        # B×2×H×W tensor
    flow_cov = flow_output.cov     # B×3×H×W tensor or None
    flow_mask = flow_output.mask   # B×1×H×W tensor or None

Interface​

Output Structure​

Methods to Implement​

Implementations​

Base Models​

Accelerated Models​

Usage in MAC-VO​