I am trying to extract region features where class detection is higher than some threshold using the detectron2 framework. After referring to this link(Detectron2 - Extract region features at a threshold for object detection), I successfully extracted image features for object detection. I use the following code to extract features based on the FPN model. The config is: "COCO-InstanceSegmentation/mask_rcnn_R_101_FPN_3x.yaml"
Feature extraction code:
img_path = "input.jpg"
img_ori = cv2.imread(img_path)
height, width = img_ori.shape[:2]
img = predictor.transform_gen.get_transform(img_ori).apply_image(img_ori)
img = torch.as_tensor(img.astype("float32").transpose(2, 0, 1))
inputs = [{"image": img, "height": height, "width": width}]
with torch.no_grad():
imglists = predictor.model.preprocess_image(inputs) # don't forget to preprocess
features = predictor.model.backbone(imglists.tensor) # set of cnn features
proposals, _ = predictor.model.proposal_generator(imglists, features, None) # RPN
proposal_boxes = [x.proposal_boxes for x in proposals]
features_list = [features[f] for f in predictor.model.roi_heads.in_features]
proposal_rois = predictor.model.roi_heads.box_pooler(features_list, proposal_boxes)
**box_features** = predictor.model.roi_heads.box_head(proposal_rois)
Question: I use box_features as the feature of the object detection. But its dimension is 1024, which is inconsistent with the original bottom-up-attention(https://github.com/peteanderson80/bottom-up-attention) image feature dimension in 2048. They all use residual-101 as the backbone network, so why are the feature dimensions inconsistent?