Looping images in, saving ID name and storing it correspondingly in dataframe

Question

Hellos,

Introduction:

I'm trying to set up a panda dataframe to connect a number of discrete chemical values to a number of images. It's a tiny bit above my current level, so I was hoping for some help here.

What I got so far:

I've currently sliced out two columns from a provided datasheet that looks like this.

Now I have 1688 datapoints here and I have 1225 images of the size 10x10x4 (RGBA) that is to be associated with it. (1225,10,10,4) Uint8

These images, all have the same Sample_ID name seen in column 1. My goal is to run a loop, that scoops up the images from the folder, flattens and reshapes them into 300x1 and then stores them in a 3rd column that is checked against the Sample_ID. This means that the correct image must correspond to the correct Sample_ID.

I've scoured the net and here on stackoverflow. I've already tried 4 different looping image functions from here which didn't give me quite the result I expected.

My best bet so far seems to have been using glob to throw it all in a numpy file. But I surely need a looping function that links the image with the corresponding id and Ni value.

Any suggestions on how I can load the image in and store its ID value for cross referencing it with the existing dataframe.

Thank you for your time.

Yes like 323727.png for example. Sorry about being unclear on this. — Mars, Sep 24 '18 at 19:30

hellpanderr · Accepted Answer · 2018-09-24T20:30:45.623

1

Assuming image id is in its name and using matplotlib.image.imread

path = '.' # current directory
filenames = [os.path.abspath(os.path.join(path, x)) for x in os.listdir(path) if '.png' in x or '.jpg' in x]

>>> filenames
['image_0.png',
 'image_1.png',
 'image_2.png',
 'image_3.png',
 'image_4.png',
 'image_5.png',
 'image_6.png',
 'image_7.png',
 'image_8.png',
 'image_9.png']

Read images into a dataframe and add their names as a column:

from matplotlib.image import imread
images_df = pd.DataFrame([[imread(filename).flatten()] for filename in filenames], columns=['images'])
images_df['id'] = filenames
images_df['id'] = images_df['id'].apply(os.path.basename)
>>> images_df

                                              images           id
0  [0.4627451, 0.05490196, 0.8745098, 0.79607844,...  image_0.png
1  [0.20784314, 0.93333334, 0.73333335, 0.6156863...  image_1.png
2  [0.4117647, 0.3254902, 0.8784314, 0.16470589, ...  image_2.png
3  [0.8627451, 0.6862745, 0.78431374, 0.6431373, ...  image_3.png
4  [0.44705883, 0.627451, 0.57254905, 0.78431374,...  image_4.png
5  [0.7490196, 0.007843138, 0.25490198, 0.1372549...  image_5.png
6  [0.039215688, 0.14901961, 0.5882353, 0.5137255...  image_6.png
7  [0.24705882, 0.94509804, 0.1882353, 0.38039216...  image_7.png
8  [0.35686275, 0.047058824, 0.56078434, 0.062745...  image_8.png
9  [0.8, 0.23921569, 0.99607843, 0.89411765, 0.23...  image_9.png

Extract id from images:

>>> images_df['id'] = images_df['id'].str.split('.').str[0]
0    image_0
1    image_1
2    image_2
3    image_3
4    image_4
5    image_5
6    image_6
7    image_7
8    image_8
9    image_9
Name: id, dtype: object

images_df['id'] needs to be converted into integer if Sample_ID is one.

Join dataframes:

pd.merge(images_df, new_data_rdy, left_on='id', right_on='Sample_ID')

edited Sep 24 '18 at 20:30

answered Sep 24 '18 at 18:52

hellpanderr

5,581
3
33
43

Well I just went to bed, however, upon seing your message I rolled out, got back up and I'm SSHing in now. This looks really good! Thank you, I'll get right back to you in a few – Mars Sep 24 '18 at 19:31
This is rather strange. I get this error halfway through your code. FileNotFoundError: [Errno 2] No such file or directory: '114605.png' Which is rather odd. I'm looking at a healthy png file in that folder by that name. – Mars Sep 24 '18 at 19:53
1

@Kongie I fixed filepaths, could you try it again? – hellpanderr Sep 24 '18 at 20:02
Changing the directory made it work. There is however one last thing. When I run the last command, pd.merge. They merge correctly, however out of 1225, I only get a dataframe of 44 merged entries. – Mars Sep 24 '18 at 20:14
1

Do you mean you want the left join or you are not getting enough matches? – hellpanderr Sep 24 '18 at 20:17
I tried adding your fixed filepath. Thank you. However it still requires me to be in the folder of the images strangely enough – Mars Sep 24 '18 at 20:17
I meant that it should have 1225 matches. But it matches only 44. After your filepath change, it doesnt even find 1 now. I'm just changing it back. images_df has 1225 entries. new_data_rdy has 1688 currently. Out of that, comes only 44 matches. :I – Mars Sep 24 '18 at 20:19
1

@Kongie Paths should work now, could you check the types of ID columns on both dataframes and show how do they look? – hellpanderr Sep 24 '18 at 20:32
Path does work now @hellpanderr. However! Heh, it writes the entirety of the path all the way from the home folder. I fixed that now by adding another line to your split code. images_df['id'] = images_df['id'].str.split('S2/').str[1]. But back to the problem of it only merging 44 in total. I can't post pictures in the comment section. But it seems that the merge only allows for 44 matches. The original data consist of both numbers and letters unfortunately. They are real world data taken over 20 years. So some of the samples are named like "Midt-1", some 92-849 and some 3230494. – Mars Sep 24 '18 at 20:38
1

Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/180692/discussion-between-hellpanderr-and-kongie). – hellpanderr Sep 24 '18 at 20:43
As we found out by chat, in case it may be useful to others, both id fields had to be set to .astype(str). And as precaution, we also stripped the Sample_ID of empty spaces. This allowed for a succesfull merge! – Mars Sep 24 '18 at 21:03

Looping images in, saving ID name and storing it correspondingly in dataframe

1 Answers1

Linked