I've approximately 300k images that formatted .jpg in my dataset. But the images are different dimensioned. I want to convert rgb channels of all images into .csv file, but what I should write to empty cells? It may be put 'N' character but I want to organize the .csv file with numpy and DataFrame. Any idea? (The dataset is for creating Deep Learning model)
-
Are you doing image processing (you can use [Pillow](https://pillow.readthedocs.io/en/stable/) for that) or metadata processing? – Laurent LAPORTE Feb 06 '19 at 12:27
-
I don't image processing, the dataset is for creating deep learning model. – ATES Feb 06 '19 at 12:30
-
I guess you could resize them all to a common size easily enough... I hope you have lots of disk for this inefficient storage technique! – Mark Setchell Feb 06 '19 at 12:32
-
It may be resizing to common size, but in this case most of the data may be lost, it isn't a solution. – ATES Feb 06 '19 at 12:36
-
Why won't you put something like np.nan in the empty cells? – Ron U Feb 06 '19 at 12:55
-
Can I analyze and pilot the data correctly in this case? – ATES Feb 06 '19 at 12:59
1 Answers
This started as a comment but got too long. I think the answer depends enough on what you want the code to do when a value is missing.
In the event that a pixel is empty, for example, setting white (255,255,255) or black (0,0,0) may be the least invasive for the Deep Learning model (you'd need to look into the way this works). I found that stretching/scaling the image was actually the best way to go.
Just writing empty records (empty string or gap between commas) in csv is an option, see this answer. If you're using numpy.genfromtxt
to read the data, you can then set missing_values
and filling_values
as needed. You could also makeup an exacting value which would never naturally occur for empty records like 99999
or DEADBEEF
to identify these and write code to parse as needed.
One consideration is that you will need to reshape the data to the same image dimensions after reading, so make sure whatever format you choose keeps the same number of lines.
Also, do you need one big CSV file or lots of smaller ones? If storing lots of files, you could consider adding header data to specify the actual size of the data so you only need to store the image, then skip_header
in genfromtxt and pad as needed.
Finally, you'd be much better to use binary as you have lots of data, consider this as it will take less space and read/write more quickly.

- 12,716
- 2
- 43
- 55