How to convert and organize different dimensioned rgb images into CSV file?

Question

I've approximately 300k images that formatted .jpg in my dataset. But the images are different dimensioned. I want to convert rgb channels of all images into .csv file, but what I should write to empty cells? It may be put 'N' character but I want to organize the .csv file with numpy and DataFrame. Any idea? (The dataset is for creating Deep Learning model)

Are you doing image processing (you can use [Pillow](https://pillow.readthedocs.io/en/stable/) for that) or metadata processing? — Laurent LAPORTE, Feb 06 '19 at 12:27
I don't image processing, the dataset is for creating deep learning model. — ATES, Feb 06 '19 at 12:30
I guess you could resize them all to a common size easily enough... I hope you have lots of disk for this inefficient storage technique! — Mark Setchell, Feb 06 '19 at 12:32
It may be resizing to common size, but in this case most of the data may be lost, it isn't a solution. — ATES, Feb 06 '19 at 12:36

score 0 · Accepted Answer · answered Feb 06 '19 at 13:45

This started as a comment but got too long. I think the answer depends enough on what you want the code to do when a value is missing.

In the event that a pixel is empty, for example, setting white (255,255,255) or black (0,0,0) may be the least invasive for the Deep Learning model (you'd need to look into the way this works). I found that stretching/scaling the image was actually the best way to go.

Just writing empty records (empty string or gap between commas) in csv is an option, see this answer. If you're using numpy.genfromtxt to read the data, you can then set missing_values and filling_values as needed. You could also makeup an exacting value which would never naturally occur for empty records like 99999 or DEADBEEF to identify these and write code to parse as needed.

One consideration is that you will need to reshape the data to the same image dimensions after reading, so make sure whatever format you choose keeps the same number of lines.

Also, do you need one big CSV file or lots of smaller ones? If storing lots of files, you could consider adding header data to specify the actual size of the data so you only need to store the image, then skip_header in genfromtxt and pad as needed.

Finally, you'd be much better to use binary as you have lots of data, consider this as it will take less space and read/write more quickly.

Thanks for your advices, I'll try. – ATES Feb 07 '19 at 15:02 — ATES, Feb 07 '19 at 15:02

How to convert and organize different dimensioned rgb images into CSV file?

1 Answers1