31

Say we have a single channel image (5x5)

A = [ 1 2 3 4 5
      6 7 8 9 2
      1 4 5 6 3
      4 5 6 7 4
      3 4 5 6 2 ]

And a filter K (2x2)

K = [ 1 1
      1 1 ]

An example of applying convolution (let us take the first 2x2 from A) would be

1*1 + 2*1 + 6*1 + 7*1 = 16

This is very straightforward. But let us introduce a depth factor to matrix A i.e., RGB image with 3 channels or even conv layers in a deep network (with depth = 512 maybe). How would the convolution operation be done with the same filter ? A similiar work out will be really helpful for an RGB case.

Teymour
  • 1,832
  • 1
  • 13
  • 34
Aragorn
  • 477
  • 1
  • 8
  • 12

4 Answers4

19

Lets say we have a 3 Channel (RGB) image given by some matrix A


    A = [[[198 218 227]
          [196 216 225]
          [196 214 224]
          ...
          ...
          [185 201 217]
          [176 192 208]
          [162 178 194]]

and a blur kernal as


    K = [[0.1111, 0.1111, 0.1111],
         [0.1111, 0.1111, 0.1111],
         [0.1111, 0.1111, 0.1111]]

    #which is actually 0.111 ~= 1/9

The convolution can be represented as shown in the image below convolution of RGB channel

As you can see in the image, each channel is individually convoluted and then combined to form a pixel.

Muthukrishnan
  • 2,047
  • 2
  • 16
  • 16
  • 1
    This is how blurring operation works. In convolution, kernels weights for each channel are different and we add the 3 channels together to produce a single channels output. In order to produce m channels, we will need m 3*3 filters with different weights in each kernel. – Onkar Chougule Dec 24 '21 at 09:38
16

They will be just the same as how you do with a single channel image, except that you will get three matrices instead of one. This is a lecture note about CNN fundamentals, which I think might be helpful for you.

Lifu Huang
  • 11,930
  • 14
  • 55
  • 77
  • 3
    Hi, when you say 3 matrices, do you mean that you take a filter and dot product with the first matrix and sum it up with the filter dot product with the second matrix and sum it up with the filter dot product with the third matrix? This will then give you a single value for that location. Am i correct? – Ninja Dude Mar 06 '18 at 09:49
  • Has the question in the comments been confirmed? – Jonathan Aug 06 '18 at 20:35
  • 15
    **Beware of the difference** in convolutions for CNN and image pre-processing (like Gaussian Blur)! The former apply a 'deep' Kernel (with *different* filters for each channel), then effectively sum up the output matrices (along with a bias terms) to yield a single-channel feature map. Whereas the 'blurring' of the RGB image yields the filtered RGB image back by applying the *same* filters to each channel and nothing more. – Alaroff Sep 09 '18 at 12:01
  • 1
    @Desmond Yes, you are correct, you will get a single value of the location, but most probably, instead of dot product each channel with the same filter, you will train three different "filters" for each channel (which can also be viewed as training one three dimensional filter M x N x D, where D is 3 for RGB images). – Lifu Huang Sep 10 '18 at 18:18
  • 1
    I found this answer difficult to understand, but the linked lecture notes are excellent. – craq Sep 17 '19 at 03:07
  • @NinjaDude I think the same, **do you mean that you take a filter a...**? If someone could confirm if it is so :) – Sayan Dey Jul 09 '21 at 15:12
  • 1
    This is the same lecture video: https://youtu.be/bNb2fEVKeEo?t=1311 At this point, the 3 RGB channels are somehow squashed into a single channel on the right side. Not sure if they mention this in the video but they conveniently skipped this in ppt. My guess is that its just sum of all 3 values to get a single value like Alaroff mentioned. – theprogrammer Nov 30 '22 at 01:02
13

In Convolution Neural Network, Convolution operation is implemented as follows, (NOTE: COnvolution in blur / filter operation is separate)

For RGB-like inputs, the filter is actually 223, each filter corresponse to one color channel, resulting three filter response. These three add up to one flowing by bias and activation. finally, this is one pixel in the output map.

shantanu pathak
  • 2,018
  • 19
  • 26
flankechen
  • 1,225
  • 16
  • 31
1

If you're trying to implement a Conv2d on an RGB image this implementation in pytorch should help.

Grab an image and make it a numpy ndarray of uint8 (note that imshow needs uint8 to be values between 0-255 whilst floats should be between 0-1):

link = 'https://oldmooresalmanac.com/wp-content/uploads/2017/11/cow-2896329_960_720-Copy-476x459.jpg'

r = requests.get(link, timeout=7)
im = Image.open(BytesIO(r.content))
pic = np.array(im)

You can view it with

f, axarr = plt.subplots()
axarr.imshow(pic)
plt.show()

Create your convolution layer (initiates with random weights)

conv_layer = nn.Conv2d(in_channels=3, 
           out_channels=3,kernel_size=3, 
           stride=1, bias=None)

Convert input image to float and add an empty dimension because that is the input pytorch expects

pic_float = np.float32(pic)
pic_float = np.expand_dims(pic_float,axis=0)

Run the image through the convolution layer (permute changes around the dimension location so they match what pytorch is expecting)

out = conv_layer(torch.tensor(pic_float).permute(0,3,1,2))

Remove the extra first dim we added (not needed for visualization), detach from GPU and convert to numpy ndarray

out = out.permute(0,2,3,1).detach().numpy()[0, :, :, :]

Visualise the output (with a cast to uint8 which is what we started with)

f, axarr = plt.subplots()
axarr.imshow(np.uint8(out))
plt.show()

You can then change the weights of the filters by accessing them. For example:

kernel = torch.Tensor([[[[0.01, 0.02, 0.01],
                     [0.02, 0.04, 0.02],
                     [0.01, 0.02, 0.01]]]])

kernel = kernel.repeat(3, 3, 1, 1)
conv_layer.weight.data = kernel
Gal_M
  • 468
  • 2
  • 14