How to webscrape images from Google News?

Question

I'm still a bit new to webscraping, trying to scrape the article images from a Google News page and display them in my Django template. I've been following along with the tutorial from towardsDataScience that can be found here. Right now I'm just trying to get the img html data from each article div just to check that I am able to pull the data. The format should look like this: <img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/> However at the moment my code is returning an empty dictionary which tells me that I am not targeting the image correctly. Any advice from those who are more experienced would be welcome.

from django.shortcuts import render, HttpResponse, redirect
from django.contrib import messages
from .models import *
from django.db.models import Count
import requests, urllib.parse
from bs4 import BeautifulSoup

import requests
from bs4 import BeautifulSoup

def index(request):
    URL = 'https://www.google.com/search?q=beyond+meat&rlz=1C1CHBF_enUS898US898&sxsrf=ALeKk00IH9jp1Kz5-LSyi7FUB4rd6--_hw:1624935518812&source=lnms&tbm=nws&sa=X&ved=2ahUKEwicqIbD7LvxAhVWo54KHXgRA9oQ_AUoAXoECAEQAw&biw=1536&bih=754'
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, 'html.parser')
    headers = soup.find_all('div', class_='BNeawe vvjwJb AP7Wnd')
    header_dict = []
    for h in headers:
        header_dict.append(h.text)
    image = soup.find_all('div', class_="qV9w7d")
    context= {
        "header_dict": header_dict,
        "example": image,
    }
    return render(request, 'index.html', context)

What are you passing to the `request` argument in your `index` function? — MendelG, Jul 15 '21 at 23:21

MendelG · Accepted Answer · 2021-07-15T23:35:45.783

0

If you disable JavaScript on the website, you'll see that the image class-name changes from qV9w7d to EYOsld. And The requests module doesn't support JavaScript.

So, instead of:

image = soup.find_all('div', class_="qV9w7d")

Use:

image = soup.find_all('img', class_="EYOsld")

NOTICE! the images (the src) attribute are in base64 encoding, since there's already an in-depth post on that, I'll include the link here.

edited Jul 15 '21 at 23:35

answered Jul 15 '21 at 23:25

MendelG

14,885
4
25
52

Awesome, thanks a lot. That may actually explain some issues I was having with other projects - the JavaScript just needed to be disabled for me to grab the correct class-name. – melandalin Jul 16 '21 at 02:23
1

@melandalin CLASS names are dynamic names within HTML as Google implementation! from day to day you will notice it's got changed! – αԋɱҽԃ αмєяιcαη Jul 16 '21 at 08:15

How to webscrape images from Google News?

1 Answers1