How to retrieve images from a url in a pandas dataframe and store them as PIL object in a new column

Question

I'm trying to store as a PIL object in a new column of a dataframe pictures that are located in a column of the same dataframe in the form of URL's.

I've tried the following code:

import pandas as pd
from PIL import Image
import requests
from io import BytesIO

pictures = [None] * 2

df = pd.DataFrame({'project_id':["1", "2"], 
                    'image_url':['http://www.personal.psu.edu/dqc5255/gl-29.jpg',
                                'https://www.iprotego.com/wp-content/uploads/google.jpg']})

# Previously the second link was broken and led to an error, I just edited it and now works fine

df.insert(2, "pictures", pictures, True)

for i in range(2):
    r = requests.get(df.iloc[i,1]) 
    df.iloc[i,2] = Image.open(BytesIO(r.content))

df

I expected to get a dataframe with this format but including both training examples:

    project_id                  image_url                                  pictures
0       1    http://www.personal.psu.edu/dqc5255/gl-29.jpg <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=400x300 at 0x116EF9AC8>

But instead got the following error:

OSError: cannot identify image file <_io.BytesIO object at 0x116ec2f10>

Take a look: https://stackoverflow.com/questions/31077366/pil-cannot-identify-image-file-for-io-bytesio-object — rafaelc, Aug 28 '19 at 21:59
Thanks rafaeIc, I took a look at the suggested post but cannot solve the error. What is weird is that if I change the range of the for loop to 1 it works for the first training example. But if I keep it at 2 it give me the mentioned error. — Jo_Gisbert, Aug 28 '19 at 22:33

Spaceship222 · Accepted Answer · 2019-08-29T01:04:30.420

0

According to my test, if you make a request to second url with default header,you will be forbidden to access content(I guess server think you are web spider under "User-agent" == "python-requests/2.xx.x").So just change headers User-agent to be the value(e.g Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Firefox/60.0) as if your browser requests

edited Aug 29 '19 at 01:04

answered Aug 29 '19 at 00:44

Spaceship222

759
10
20

Thanks for your help @Spaceship222, I just tried changing the `User-agent` in the for loop so that now the request line in the loop is: `r = requests.get(df.iloc[i,1], headers=headers)` with `headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/XXX.XX (KHTML, like Gecko) Chrome/XX.X.XXXX.XXX Safari/XXX.XX'}` but the same error occurs: `OSError: cannot identify image file <_io.BytesIO object at 0x1105bfd00>` @rafaelc – Jo_Gisbert Aug 29 '19 at 08:54
I guess your error comes from wrong url since second url should be `'https://www.iprotego.com/wp- content/uploads/google.jpg'`. There is no space between 'wp-' and 'content'. It would be better to check response status by `r.raise_for_status()` before using `r.content` – Spaceship222 Aug 29 '19 at 09:04

How to retrieve images from a url in a pandas dataframe and store them as PIL object in a new column

1 Answers1