核心代码可以分为下面3个模块
encoder:各模态各自编码(camera + radar)
fuser:跨模态融合
decoder backbone+neck:对融合结果再加工成最终给 head 的特征
encoder:各模态各自编码(camera + radar)
camera编码:
radar编码:
fuser:跨模态融合
多模态融合
x = self.fuser(features) 会把features列表里面的相机BEV特征和雷达BEV特征融合成一个统一的BEV特征图
class ConvFuser(nn.Sequential):
输入参数:
feature: 类型:列表,长度2
feature[0]: camera BEV, 形状[B, 64, H, W]
feature[1]: radar BEV, 形状[B, 64, H, W]
输出参数:
x: 形状[B, 64, H, W]
实现过程:
step1: 拼接通道z = cat(features, dim=1), 形状[B, 128, H, W]
step2: 再通过3*3的卷积(128->64), 后BN和Relu, 如下图所示
class ConvFuser(nn.Sequential): def __init__(self, in_channels: int, out_channels: int) -> None: self.in_channels = in_channels self.out_channels = out_channels super().__init__( nn.Conv2d(sum(in_channels), out_channels, 3, padding=1, bias=False), nn.BatchNorm2d(out_channels), nn.ReLU(True), ) def forward(self, inputs: List[torch.Tensor]) -> torch.Tensor: return super().forward(torch.cat(inputs, dim=1))
decoder backbone+neck:对融合结果再加工成最终给 head 的特征