Trying to extract the source link of the first image of a google search using beautifulsoup4

Question

I am trying to have a program that returns the image link for the first image of a google search.

The link I am trying to get is if you were click the first image, right clicking the image that appears and then opening the image. The current code I have is.

r = requests.get(theurl)
soup = BeautifulSoup(r.text,"lxml")
link = soup.find('img', class_='irc_mi')['src']
return link

However I get a type error that says "TypeError: 'NoneType' object is not subscriptable".

could you give us an example of a page you wish to get the image from? Just a typical gooogle image search? — Sean Breckenridge, Apr 21 '18 at 05:27
If you were to search up python using the program it would give you this [link](https://www.python.org/static/opengraph-icon-200x200.png). This link is gotten from searching python on google images, clicking the first result, and then right clicking and opening the image in a new tab. — Naveen Manoharan, Apr 21 '18 at 05:42
I dont think the links for those are on the page when its loaded. i.e. if you're trying to get the 'large version', you'd have to actually let the page load, click on it, and then pull the source. This is still possible, using something like selenium. — Sean Breckenridge, Apr 21 '18 at 06:20
Display some block of example which you want to retrieve the content from it. so that you can receive answer you are searching for — Rachit kapadia, Apr 21 '18 at 08:51

radzak · Answer 1 · 2018-04-22T12:18:38.450

2

It appears that the src attributes are added due to the JavaScript running in the browser. You can use Requests-HTML to achieve your goal:

from requests_html import HTMLSession

session = HTMLSession()
url = 'https://www.google.pl/search?q=python&source=lnms&tbm=isch&sa=X&ved=0ahUKEwif6Zq7i8vaAhVMLVAKHUDkDa4Q_AUICigB&biw=1280&bih=681'
r = session.get(url)
r.html.render()

first_image = r.html.find('.rg_ic.rg_i', first=True)
link = first_image.attrs['src']

edited Apr 22 '18 at 12:18

answered Apr 21 '18 at 10:05

radzak

2,986
1
18
27

Does that download every link from the webpage? As after running it, it seems to be downloading something. If it is downloading all the links that seems like it would take a long time for the users image search to process. – Naveen Manoharan Apr 21 '18 at 20:24
[RequestsHTML](https://html.python-requests.org/) has a [JavaScript support](https://html.python-requests.org/#javascript-support) that uses Chromium behind the scenes. I believe that's the reason why it takes a while for the script to finish. – radzak Apr 22 '18 at 12:18

Dmitriy Zub · Answer 2 · 2021-10-06T10:38:26.453

You can achieve this using selenium but the execution time will be slower than using bs4.

To scrape the original image link using bs4, you need to parse <script> tags with regex and then parse those links.

For example, part of the code (check out full example in the online IDE):

# find all script tags
all_script_tags = soup.select('script')

# find all full res images
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
                                                    all_script_tags)

# iterate over found matches and decode them
for fixed_full_res_image in matched_google_full_resolution_images:
    original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
    original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
    print(original_size_img)

-----
'''
https://external-preview.redd.it/mAQWN2kUYgFS3fgm6LfYo37AY7i2e_YY8d83_1jTeys.jpg?auto=webp&s=b2bad0e23cbd83426b06e6a547ef32ebbc08e2d2
https://i.ytimg.com/vi/_mR0JBLXRLY/maxresdefault.jpg
https://wallpaperaccess.com/full/37454.jpg
...
'''

Alternatively, you can achieve this easily by using Google Images API from SerpApi. It's a paid API with a free plan.

The difference is that you don't need to figure out how to scrape something or maintain the parser if something will change over time. All that needs to be done is just to iterate over structured JSON and extract needed data.

Code to integrate:

import os, json
from serpapi import GoogleSearch

params = {
  "api_key": os.getenv("API_KEY"),
  "engine": "google",
  "q": "minecraft shaders 8k photo",
  "tbm": "isch"
}

search = GoogleSearch(params)
results = search.get_dict()

print(json.dumps(results['suggested_searches'], indent=2, ensure_ascii=False))
print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))

------
'''
[
...
  {
    "position": 30,
    "thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQ_CjA8J1P5Y6bN2KCuY6XgS4mFvctuwhho6A&usqp=CAU",
    "source": "wallpaperbetter.com",
    "title": "minecraft shaders video games, HD wallpaper | Wallpaperbetter",
    "link": "https://www.wallpaperbetter.com/en/hd-wallpaper-cusnk",
    "original": "https://p4.wallpaperbetter.com/wallpaper/120/342/446/minecraft-shaders-video-games-wallpaper-preview.jpg",
    "is_product": false
  }
...
]
'''

I have already answered a similar question here and wrote a dedicated blog about how scrape and download Google Images with Python.

Disclaimer, I work for SerpApi.

score 0 · Answer 3 · answered Apr 21 '18 at 05:03

0

You have a typo - _class not class.

Also - you don't actually need to supply the class name attribute.

r = requests.get(theurl)
soup = BeautifulSoup(r.text, "lxml")
link = soup.find("img", "irc_mi")["src"]
return link

answered Apr 21 '18 at 05:03

Fraser

15,275
8
53
104

Even after removing the class_ , i still receive the same error – Naveen Manoharan Apr 21 '18 at 05:27

Trying to extract the source link of the first image of a google search using beautifulsoup4

3 Answers3