0

I am working on Plant Seedlings dataset on Kaggle and I have prepared a dataframe which has 2 columns.

The first column has the directory of each image that is present in the train set and the second column has the label(name) of that image.

I want to convert it into a dataframe in such a way that I can then use this dataframe to train my model on.

Also, the image has 3 channels.

Given that the name of the dataframe which has directory and label as arr.

                              file               category
0        ../input/train/Maize/a5c2eec2d.png        Maize
1        ../input/train/Maize/8cd93b279.png        Maize
2        ../input/train/Maize/8c6fba454.png        Maize
3        ../input/train/Maize/abadd72ab.png        Maize
4        ../input/train/Maize/f60369038.png        Maize

How should I do the above mentioned task ?

Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108
asn
  • 2,408
  • 5
  • 23
  • 37
  • 1. `from PIL import Image`, 2. `df['image'] = df['file'].apply(lambda x: np.asarray(Image.open(x)))`. Hope this helps – Bharath M Shetty Sep 23 '18 at 05:07
  • @Dark Will df['image'] be able to work with the RGB channel ? – asn Sep 23 '18 at 05:10
  • 1
    each cell of df['image'] will have a 3-D array with dimension (:,:,3) now. In other words each 3rd dimension layers will correspond to RGB layers – Bharath M Shetty Sep 23 '18 at 05:12
  • @Dark I think that above dataframe will consume hell amount of memory, is there any way that I can use lesser memory ? – asn Sep 23 '18 at 05:16
  • Yes you can go with a `for` loop and append images to a empty list. Something like `for i in df['file']: imgs.append(np.asarray(Image.open(i)))`. Usually this is preferred way since you can also apply mutli processing to load images faster. – Bharath M Shetty Sep 23 '18 at 05:18
  • Actually, when i print the shape of the df['image'] column then it is giving me `(943, 943, 3)`. Is there any way that I can convert it into let's say `(224,224,3)` I wanted the 3rd column to consume lesser amount of memory. – asn Sep 23 '18 at 05:20
  • Refer this : https://stackoverflow.com/questions/273946/how-do-i-resize-an-image-using-pil-and-maintain-its-aspect-ratio this might help. Also there are kernels in Kaggle that show how to load images effectively. Do search for them :) – Bharath M Shetty Sep 23 '18 at 05:23
  • @Jacob you can resize the image to (224,224,3) using resize function `for i in df['file']: imgs.append(np.asarray(Image.open(i).resize((224,224)))`. – Jagadeesh Dondeti Sep 23 '18 at 06:02

1 Answers1

0
from PIL import Image
import numpy as np

dataset = []
# If you to encode category names you can do the following
# df['category_code'] = df['category'].cat.codes 
# and you can iterate over this in for loop
for image_name, category in zip(df['file'],df['category']):
    image = np.asarray(Image.open(image_name))
    dataset.append((image,category))

For resizing an image to a particular size,

image = np.asarray(Image.open(image_name).resize(size))

where size is a tuple like (224,224)