1

I'm learning web scraping with python, and as a way to do an all-in-one exercise I'm trying to make a game catalog by utilizing Beautiful Soup and requests modules as my main tools. Though, the problem lies while handling sentences related to requests module.

DESCRIPTION:

The exercise is about getting all genres tags used for classifying games starting with A letter in the first page. Each page shows around or exactly 30 games, so if one wants to access a specific page independently of a letter, has to access to an url in this form.

  1. https://vandal.elespanol.com/juegos/13/pc/letra/a/inicio/1
  2. https://vandal.elespanol.com/juegos/13/pc/letra/a/inicio/2
  3. https://vandal.elespanol.com/juegos/13/pc/letra/a/inicio/3

And so on... As a matter of fact, each alphabet letter main page has the form: URL: https://vandal.elespanol.com/juegos/13/pc/letra/ which is equivalent to https://vandal.elespanol.com/juegos/13/pc/letra/a/inicio/.

Making my way to scrape genres from some pages is not big deal but what if i want to scrape them all of a letter, how do i know when I'm done scraping genres from all games of a letter?

When you request the url https://vandal.elespanol.com/juegos/13/pc/letra/a/inicio/200 for example, you get redirected to a corresponding letter main page, which means the first 30 games, since in the end it doesn't have more games to return. So while bearing that in mind.. i was thinking about verifying the status_code got from requests.get() response, but get a 200 as status code whereas when analizing packages received with Chrome Tools i got 301 as status code. In the end of the program i save to a file the scraped genres.

Here's the picture. enter image description here

And here's the code:

from bs4 import BeautifulSoup
import string
import requests
from string import ascii_lowercase



def write_genres_to_file(site_genres): 
    with open('/home/l0new0lf/Desktop/generos.txt', 'w') as file_:
        print(f'File "{file_.name}" OPENED to write {len(site_genres)} GENRES')
        counter = 1
        site_genres_length = len(site_genres)
        for num in range(site_genres_length):
            print('inside File Loop')
            if counter != 2:
                if counter == 3:
                    file_.write(f'{site_genres[num]}' + '\n')
                    print('wrote something')
                    counter = 0
                    
                else: file_.write(f'{site_genres[num]}')
                
            else: file_.write(f'{site_genres[num]:^{len(site_genres[num])+8}}')
            print(f'Wrote genre "{site_genres[num]}" SUCCESSFULLY!')
            counter +=1 

def get_tags():

    #TITLE_TAG_SELECTOR = 'tr:first-child td.ta14b.t11 div a strong'
    #IMG_TAG_SELECTOR = 'tr:last-child td:first-child a img'
    #DESCRIPTION_TAG_SELECTOR = 'tr:last-child td:last-child p'
    GENRES_TAG_SELECTOR = 'tr:last-child td:last-child div.mt05 p'
    GAME_SEARCH_RESULTS_TABLE_SELECTOR = 'table.mt1.tablestriped4.froboto_real.blanca'

    GAME_TABLES_CLASS = 'table transparente tablasinbordes'

    site_genres = []

    for i in ['a']:
        counter = 1

        while True:
            rq = requests.get(f'https://vandal.elespanol.com/juegos/13/pc/letra/{i}/inicio/{counter}')
            
            
            if rq: 
                print('Request GET: from ' + f'https://vandal.elespanol.com/juegos/13/pc/letra/{i}/inicio/{counter}' + ' Got Workable Code !')
                
            
            if rq.status_code == 301 or rq.status_code == 302 or rq.status_code == 303 or rq.status_code == 304:
                print(f'No more games in letter {i}\n**REDIRECTING TO **')
                break

            counter +=1
                            
            soup = BeautifulSoup(rq.content, 'lxml')
            main_table = soup.select_one(GAME_SEARCH_RESULTS_TABLE_SELECTOR)
            #print('This is the MAIN TABLE:\n' + str(main_table))
            game_tables = main_table.find_all('table', {'class': GAME_TABLES_CLASS})
            #print('These are the GAME TABLES:\n' + str(game_tables))
            for game in game_tables:

                genres_str = str(game.select_one(GENRES_TAG_SELECTOR).contents[1]).strip().split(' / ')
                for genre in genres_str:
                    if not genre in site_genres:
                        site_genres.append(genre)
            
            
    write_genres_to_file(site_genres)
    
            
get_tags()

So, roughly, my question is: How do could i know when i'm done scraping all games starting with a certain letter in order to start scraping the games from the next one?.

NOTE: I only could think of comparing every time in the loop if returned html structure is the same compared with first page of a letter or maybe evaluating if I'm receiving repeated games. But i think this shouldn't the way i go about.

Any help is truly welcomed, and I'm very sorry for the very looong problem description, but thought that it was necessary.

L0neW0lf
  • 33
  • 7

1 Answers1

1

I simply would not just rely on a status code. You might get a non 200 status even for pages that are there. For example if you exceed a certain amount described in their robots.txt, or if your network has a delay or error.

So, to reply to your question: "How do I ensure that I scraped all pages corresponding to a certain letter?". To ensure it you may save all the "visible text" as in this reply BeautifulSoup Grab Visible Webpage Text and hash its content. When you hit the same hash, then you know that you already crawled/scraped that page. Therefore you can then incrementally go on the next letter.

As an example of hash snippet, I would use the following:

def from_text_to_hash(url: str) -> str:
    """ Getting visible-text and hashing it"""
    url_downloaded = urllib.request.urlopen(url)
    soup = BeautifulSoup(url_downloaded, "lxml")
    visible_text = soup.title.text + "\t" + soup.body.text
    current_hash = str(hash(visible_text))
    return current_hash

And you keep track in a set of the variable current_hash

purple_lolakos
  • 456
  • 5
  • 15
  • Many thanks for your answer, it really solves the problem at hand, but each letter has around 200 pages, each one with a lot of text to save. Saving all this into memory would be very inefficient AFAIK in terms of memory. – L0neW0lf Feb 11 '21 at 09:46
  • You don't have to save the text, you extract the visible text, hash it and store in a "set" data structure its hash. As you say, if each letter has around 200 pages, then you will save 200 hash. A hash is way shorter. Hope this explains :) – purple_lolakos Feb 11 '21 at 10:06