I have a dataset of 500k+ images. All images are in a single folder. The labels are in a csv file with column1 = filename and column2 = class.
I know there is .flow_from_dataframe however I need to make additional changes before using it in a CNN. I have not been able to find a way of converting the DataFrameIterator object that results from it into either a numpy array or pandas dataframe - unless someone knows how which would solve my issues.
My alternative is to load the images and labels into 2 different dataframes and then merge them by using the image filename.
I've used this but can't figure out how to also add the filenames:
import glob
cv_img = []
for img in glob.glob("foo_test/*.jpg"):
n= cv2.imread(img)
cv_img.append(n)
I've also used this:
import os
from PIL import Image
path = 'foo_test'
images = [f for f in os.listdir(path) if os.path.splitext(f)[-1] == '.jpg']
for image in images:
Image.open(image)
which gives me:
FileNotFoundError: [Errno 2] No such file or directory: '0.jpg'
although 0.jpg is very much there. I don't see why that would fail.
I've gone through over 30 posts on Stack Overflow and can't find a simple way of doing this.
My only other alternative is to move each image into separate folders with the name as class and then load them by folder name for which I already have a code that will 100% work. But with half a million images that's just not feasible.
Anyone have a better idea?