https://arxiv.org/pdf/2211.15654.pdf
pre-trained text,image embedding models → 3D-scene Understanding
Overview
3.1. Image Feature Fusion
Segmentation model:
LSeg, OpenSeg
Input: RGB Image with resolution H x W