0

I'm trying to write an algorithm that will save the filename and the 3 channel np.array stored in each filename to a csv (or similar filetype), and then be able to read in the csv and reproduce the color image image.

The format of my csv should look like this:

  Filename RGB
0 foo.png  np.array      # the shape is 100*100*3
1 bar.png  np.array
2 ...      ...

As it stands, I'm iterating through each file saved in a directory and appending a list that later gets stored in a pandas.DataFrame:

df1= pandas.DataFrame()
df2= pandas.DataFrame()
directory= r'C:/my Directory'
fileList= os.listdir(directory)
filenameList= []
RGBList= []
for eachFile in fileList:
    filenameList.append(eachFile)
    RGBList.append(cv2.imread(directory + eachFile, 1).tostring())
df1["Filenames"]= filenameList
df2["RGB"]= RGBList
df1.to_csv('df1.csv')
df2.to_csv('df2.csv')

df1 functions as desired. I THINK df2 fuctions as intended. A print statement reveals the correct len of 30,000 for each row of the csv. However, when I read in the csv using pandas.read_csv('df2') and use a print statement to view the len of the first row, I get 110541. I intend to use np.fromstring() and np.reshape() to reshape the flattened np.array generated from np.tostring(), but I get the error:

ValueError: string size must be a multiple of element size

...because the number of elements is mismatched.

My question is:

  1. Why is the len so much larger when I read in the csv?
  2. Is there a more efficient way to write 3 channel color image pixel data to a csv that can easily be read back in?
  • 1
    Not sure I understand your question, but if you write a single byte for each 8-bit pixel you will get a line with 1 byte per pixel. If you write `186,` for a pixel in ASCII in a `CSV` you will get 4 bytes per pixel - 1 byte for `1`, 1 byte for `8`, one byte for `6` and 1 byte for the comma. That means your file will be around 4x bigger, i.e. 110k instead of 30k. – Mark Setchell Feb 07 '18 at 08:33
  • @MarkSetchell I think this is exactly what is happening. Is there a better way to write the data to the csv to avoid this problem? Or is there some keyword argument I'm missing in the read_csv statement? – Drew Wilkins Feb 07 '18 at 18:55
  • 1
    There is no better way to write a CSV - the problem is that is a fundamentally inefficient format designed for humans rather than computers. Why did you choose CSV? If it **has** to be legible for humans, you have no choice. If it can be illegible to humans, but readily legible to computers, choose a different format. Answer the question above please, and I'll come back to you. – Mark Setchell Feb 07 '18 at 19:06
  • @MarkSetchell It does not have to be CSV. If there is a more efficient way to store the 100x100x3 np.array and read it in while preserving the shape, that would completely answer the question. – Drew Wilkins Feb 07 '18 at 19:30
  • 1
    I'm more of an "image" person than a "Python" person, so I think I'd better refer you to this... https://stackoverflow.com/a/28440249/2836621 – Mark Setchell Feb 07 '18 at 19:37
  • @MarkSetchell I think your second comment answered the question. Instead of writing a CSV, I saved the data as an .npy using `np.save()` and loaded it in with `np.load()` which preserved the original shape. I then saw your last post directing me to https://stackoverflow.com/a/28440249/2836621, but I had already solved the problem by then. Thank you for help. Post as an answer and I'll select it. – Drew Wilkins Feb 07 '18 at 19:51
  • It's getting late here - I'll write it up in the morning. Glad you are back up and running. – Mark Setchell Feb 07 '18 at 20:04
  • I have written it up as an answer for all to see. Glad you are running ok now - good luck with your project! – Mark Setchell Feb 08 '18 at 13:27

1 Answers1

1

If you write a single byte for each 8-bit pixel you will get a line with 1 byte per pixel. So, if your image is 80 pixels wide, you will get 80 bytes per line.

If you write a CSV, in human-readable ASCII, you will need more space. Imagine the first pixel is 186. So, you will write a 1, an 8, a 6 and a comma - i.e. 4 bytes now for the first pixel instead of a single byte in binary, and so on.

That means your file will be around 3-4x bigger, i.e. 110k instead of 30k, which is what you are seeing.


There is no "better way" to write a CSV - the problem is that is a fundamentally inefficient format designed for humans rather than computers. Why did you choose CSV? If it has to be legible for humans, you have no choice.

If it can be illegible to humans, but readily legible to computers, choose a different format such np.save() and np.load() - as you wisely have done already ;-)

Mark Setchell
  • 191,897
  • 31
  • 273
  • 432