0

I have a dataset of 500k+ images. All images are in a single folder. The labels are in a csv file with column1 = filename and column2 = class.

I know there is .flow_from_dataframe however I need to make additional changes before using it in a CNN. I have not been able to find a way of converting the DataFrameIterator object that results from it into either a numpy array or pandas dataframe - unless someone knows how which would solve my issues.

My alternative is to load the images and labels into 2 different dataframes and then merge them by using the image filename.

I've used this but can't figure out how to also add the filenames:

import glob
cv_img = []
for img in glob.glob("foo_test/*.jpg"):
    n= cv2.imread(img)
    cv_img.append(n)

I've also used this:

import os
from PIL import Image

path = 'foo_test'
images = [f for f in os.listdir(path) if os.path.splitext(f)[-1] == '.jpg']

for image in images:
    Image.open(image)

which gives me:

FileNotFoundError: [Errno 2] No such file or directory: '0.jpg'

although 0.jpg is very much there. I don't see why that would fail.

I've gone through over 30 posts on Stack Overflow and can't find a simple way of doing this.

My only other alternative is to move each image into separate folders with the name as class and then load them by folder name for which I already have a code that will 100% work. But with half a million images that's just not feasible.

Anyone have a better idea?

Alex03
  • 33
  • 1
  • 7
  • Make sure `Image.open` is passed an *absolute* path to the file you want to open. – Scott Hunter Jul 20 '21 at 12:18
  • It is, same error. All I want to achieve is either a pandas df or numpy array in the format of [pixel values], filename so I can then merge it with the 2nd df containing the filenames and classes. Is there a better way of doing this? – Alex03 Jul 20 '21 at 14:26

1 Answers1

1

I've found a solution and am closing the question.

For anyone in need:

# Importing the csv into a dictionary
with open('your_labels_csv_file.csv', mode='r', newline='', encoding='utf-8') as f:
    reader = csv.reader(f)
    gt = {rows[0]:rows[1] for rows in reader}


# Importing the dataset

# setting path and list variables
path = 'your_dataset_path/'
images = [] 
target = [] 

# loading dataset
for root, dirs, files in os.walk(path):
    for file in files:
        with open(os.path.join(root, file), "r") as auto:    
            im = cv2.imread(root+'\\'+file, 0)            
            images.append(im)
            target.append(gt[file])

Closed.

Alex03
  • 33
  • 1
  • 7