U-net how to understand the cropped output

Question

I'm looking for U-net implementation for landmark detection task, where the architecture is intended to be similar to the figure above. For reference please see this: An Attention-Guided Deep Regression Model for Landmark Detection in Cephalograms

From the figure, we can see the input dimension is 572x572 but the output dimension is 388x388. My question is, how do we visualize and correctly understand the cropped output? From what I know, we ideally expect the output size is the same as input size (which is 572x572) so we can apply the mask to the original image to carry out segmentation. However, from some tutorial like (this one), the author recreate the model from scratch then use "same padding" to overcome my question, but I would prefer not to use same padding to achieve same output size.

I couldn't use same padding because I choose to use pretrained ResNet34 as my encoder backbone, from PyTorch pretrained ResNet34 implementation they didn't use same padding on the encoder part, which means the result is exactly similar as what you see in the figure above (intermediate feature maps are cropped before being copied). If I would to continue building the decoder this way, the output will have smaller size compared to input image.

The question being, if I want to use the output segmentation maps, should I pad its outside until its dimension match the input, or I just resize the map? I'm worrying the first one will lost information about the boundary of image and also the latter will dilate the landmarks predictions. Is there a best practice about this?

The reason I must use a pretrained network is because my dataset is small (only 100 images), so I want to make sure the encoder can generate good enough feature maps from the experiences gained from ImageNet.

Hi, I guess I have similar question with you. For example, the first grey line, how to crop a (64, 568, 568)(C,W,H) to a (64, 392, 392) feature map, did the author use conv or crop feature map directly in each channel? What's the method in detail? — 4daJKong, Apr 18 '23 at 07:55
@4daJKong It has been a long time and I couldn't remember it well, I implement it using the architecture I commented below. But I guess resizing the map from 568 to 392 by interpolation could work well. — kelvin hong 方, Apr 18 '23 at 11:01
Many thanks for your reply again but the problem is decrease the size of feature map from 568 to 392. I was wondering if interpolation meaning increase its size. — 4daJKong, Apr 19 '23 at 09:23
@4daJKong Hi, interpolation can also be used when downsizing, please see if this helps: https://stackoverflow.com/questions/875856/interpolation-algorithms-when-downscaling . I did mean downsizing the feature map from 568 to 392. — kelvin hong 方, Apr 20 '23 at 02:21
but could you please give some simple demo, I tried to use `torch.nn.funcional.interpolate(x, size(2,2))` to downscale a tensor `x = torch.randint(1,16,(4,4))` — 4daJKong, Apr 23 '23 at 03:01
https://stackoverflow.com/questions/58676688/how-to-resize-a-pytorch-tensor You can refer to this! — kelvin hong 方, Apr 23 '23 at 10:48

score 0 · Answer 1 · answered Dec 04 '21 at 11:32

After some thinking and testing of my program, I found that PyTorch's pretrained ResNet34 didn't loose the size of image because of convolution, instead its implementation is indeed using same padding. An illustration is

  Input(3,512,512)-> Layer1(64,128,128) -> Layer2(128,64,64) -> Layer3(256,32,32) 
  -> Layer4(512,16,16)

so we can use deconvolution (or ConvTranspose2d in PyTorch) to bring the dimension back to 128, then dilate the result 4 times bigger to get the segmentation mask (or landmarks heatmaps).

U-net how to understand the cropped output

1 Answers1