1

Im trying to write a script thats gonna scrape 9gag for images and images only. But i have faced a problem which is that my requests or the Beautifulsoup is getting the wrong HTML page. Beautifulsoup is currently getting the source-page and not the page that is containing the images.
Why is Beautifulsoup excluding the classes that contain the actual images? Or is it diffrent HTML-pages?

I have tried diffrent formats for the Beautiful soup "parser" but still getting the wrong page.

If you go to 9gag and right-click and the "inspect" you can get to the images, and the page to extract the images with script.

My script:

import requests
from bs4 import BeautifulSoup
import os


def download_image(url, fileName):          #save image function
    path = os.path.join("imgs", fileName)
    f = open(path, 'wb')
    f.write(requests.get(url).content)
    f.close()


def fetch_url(url):                        # fetching url
    page = requests.get(url)
    return page

def parse_html(htmlPage):                  #parsing the url
    soup = BeautifulSoup(htmlPage, "html.parser")
    return soup


def retrieve_jpg_urls(soup):

    list_of_urls = soup.find_all('list')       #classes wanted
    parsed_urls = []
    for index in range(len(list_of_urls)):
        try:
            parsed_urls.append(soup.find_all('img')[index].attrs['src']) #img wanted inside class
        except:
            next
    return parsed_urls


def main():
    htmlPage = fetch_url("https://9gag.com/")
    soup = parse_html(htmlPage.content)
    jpgUrls = retrieve_jpg_urls(soup)
    for index in range(len(jpgUrls)):
        try:
            download_image(jpgUrls[index], "savedpic{}.jpg".format(index))
        except:
            print("failed to parse image with url {}".format(jpgUrls[index]))
    print("")

if __name__ == "__main__":
    main()

What Beautifulsoup is getting:

<!DOCTYPE html>

<html lang="en">
<head>
<title>9GAG: Go Fun The World</title>
<link href="https://assets-9gag-fun.9cache.com" rel="preconnect"/>
<link href="https://img-9gag-fun.9cache.com" rel="preconnect"/>
<link href="https://miscmedia-9gag-fun.9cache.com" rel="preconnect"/>
<link href="https://images-cdn.9gag.com/img/9gag-og.png" rel="image_src"/>
<link href="https://9gag.com/" rel="canonical"/>
<link href="android-app://com.ninegag.android.app/http/9gag.com/" rel="alternate"/>
<link href="https://assets-9gag-fun.9cache.com/s/fab0aa49/5aa8c9f45ee3dd77f0fdbe4812f1afcf5913a34e/static/dist/core/img/favicon.ico" rel="shortcut icon"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="9GAG has the best funny pics, gifs, videos, gaming, anime, manga, movie, tv, cosplay, sport, food, memes, cute, fail, wtf photos on the internet!" name="description"/> 

I want the following:

<img src="https://img-9gag-fun.9cache.com/photo/aLgyG2V_460s.jpg" alt="There&amp;#039;s genuine friend love there" style="min-height: 566.304px;">
Nazim Kerimbekov
  • 4,712
  • 8
  • 34
  • 58
snaz
  • 41
  • 5
  • 1
    Does 9gag load the images with JavaScript? If they do, you'll have to take another approach, since requests does not execute JavaScript. – That1Guy Jul 12 '19 at 14:21
  • Yes i think it might do... the images is embedded in a class that loads with javascript – snaz Jul 12 '19 at 14:28
  • You cant get the image you want just by parsing the HTML, because images are loaded with JS. But, you can use `re` module to extract the JSON present on the page. Search for `window._config = JSON.parse(` in the HTML – abdusco Jul 12 '19 at 14:29
  • use `requests_html` library, will render javascript – Pyd Jul 12 '19 at 14:30
  • Check out [Selenium](https://selenium-python.readthedocs.io/) or [dryscrape](https://github.com/niklasb/dryscrape) for scraping with JS support. Also see [this answer](https://stackoverflow.com/a/26440563/1555990) for additional help with examples. Note that Selenium requires a display. To run it headless, see my answer [here](https://stackoverflow.com/a/13055412/1555990) – That1Guy Jul 12 '19 at 14:35

1 Answers1

1

Try extracting the JSON on the page:

import re
import json

# ...
res = requests.get(...)
html = res.content

m = re.search('JSON\.parse\((.*)\);</script>', html)
double_encoded = m.group(1)
encoded = json.loads(double_encoded)
parsed = json.loads(encoded)

images = [p['images']['image700']['url'] for p in parsed['data']['posts']]
print(images)

output:

['https://img-9gag-fun.9cache.com/photo/abY9Wg8_460s.jpg', 'https://img-9gag-fun.9cache.com/photo/aLgy4o5_460s.jpg', 'https://img-9gag-fun.9cache.com/photo/aE2LVeM_460s.jpg', 'https://img-9gag-fun.9cache.com/photo/amBEGb4_700b.jpg', 'https://img-9gag-fun.9cache.com/photo/aKxrv56_460s.jpg', 'https://img-9gag-fun.9cache.com/photo/a5M8wXN_460s.jpg', 'https://img-9gag-fun.9cache.com/photo/aNY6QEv_700b.jpg', 'https://img-9gag-fun.9cache.com/photo/aYY2Deq_700b.jpg', 'https://img-9gag-fun.9cache.com/photo/aQR0AEw_460s.jpg', 'https://img-9gag-fun.9cache.com/photo/aLgy19P_700b.jpg']
abdusco
  • 9,700
  • 2
  • 27
  • 44
  • Getting the following error when entering your code @abdusco: TypeError: cannot use a string pattern on a bytes-like object – snaz Jul 12 '19 at 16:49
  • 1
    Try changing `res.content` to `res.text` to get decoded HTML instead. – abdusco Jul 12 '19 at 18:42
  • Got it to work! many thanks! Do your know if it is possible to extract the title, upvotes and size of the images? @abdusco – snaz Jul 13 '19 at 12:47
  • You might have to visit individual pages for that – abdusco Jul 13 '19 at 12:55