I'm learning web scraping with python, and as a way to do an all-in-one exercise I'm trying to make a game catalog by utilizing Beautiful Soup and requests modules as my main tools. Though, the problem lies while handling sentences related to requests module.
DESCRIPTION:
The exercise is about getting all genres tags used for classifying games starting with A letter in the first page. Each page shows around or exactly 30 games, so if one wants to access a specific page independently of a letter, has to access to an url in this form.
- https://vandal.elespanol.com/juegos/13/pc/letra/a/inicio/1
- https://vandal.elespanol.com/juegos/13/pc/letra/a/inicio/2
- https://vandal.elespanol.com/juegos/13/pc/letra/a/inicio/3
And so on... As a matter of fact, each alphabet letter main page has the form: URL: https://vandal.elespanol.com/juegos/13/pc/letra/ which is equivalent to https://vandal.elespanol.com/juegos/13/pc/letra/a/inicio/.
Making my way to scrape genres from some pages is not big deal but what if i want to scrape them all of a letter, how do i know when I'm done scraping genres from all games of a letter?
When you request the url https://vandal.elespanol.com/juegos/13/pc/letra/a/inicio/200 for example, you get redirected to a corresponding letter main page, which means the first 30 games, since in the end it doesn't have more games to return. So while bearing that in mind.. i was thinking about verifying the status_code got from requests.get() response, but get a 200 as status code whereas when analizing packages received with Chrome Tools i got 301 as status code. In the end of the program i save to a file the scraped genres.
And here's the code:
from bs4 import BeautifulSoup
import string
import requests
from string import ascii_lowercase
def write_genres_to_file(site_genres):
with open('/home/l0new0lf/Desktop/generos.txt', 'w') as file_:
print(f'File "{file_.name}" OPENED to write {len(site_genres)} GENRES')
counter = 1
site_genres_length = len(site_genres)
for num in range(site_genres_length):
print('inside File Loop')
if counter != 2:
if counter == 3:
file_.write(f'{site_genres[num]}' + '\n')
print('wrote something')
counter = 0
else: file_.write(f'{site_genres[num]}')
else: file_.write(f'{site_genres[num]:^{len(site_genres[num])+8}}')
print(f'Wrote genre "{site_genres[num]}" SUCCESSFULLY!')
counter +=1
def get_tags():
#TITLE_TAG_SELECTOR = 'tr:first-child td.ta14b.t11 div a strong'
#IMG_TAG_SELECTOR = 'tr:last-child td:first-child a img'
#DESCRIPTION_TAG_SELECTOR = 'tr:last-child td:last-child p'
GENRES_TAG_SELECTOR = 'tr:last-child td:last-child div.mt05 p'
GAME_SEARCH_RESULTS_TABLE_SELECTOR = 'table.mt1.tablestriped4.froboto_real.blanca'
GAME_TABLES_CLASS = 'table transparente tablasinbordes'
site_genres = []
for i in ['a']:
counter = 1
while True:
rq = requests.get(f'https://vandal.elespanol.com/juegos/13/pc/letra/{i}/inicio/{counter}')
if rq:
print('Request GET: from ' + f'https://vandal.elespanol.com/juegos/13/pc/letra/{i}/inicio/{counter}' + ' Got Workable Code !')
if rq.status_code == 301 or rq.status_code == 302 or rq.status_code == 303 or rq.status_code == 304:
print(f'No more games in letter {i}\n**REDIRECTING TO **')
break
counter +=1
soup = BeautifulSoup(rq.content, 'lxml')
main_table = soup.select_one(GAME_SEARCH_RESULTS_TABLE_SELECTOR)
#print('This is the MAIN TABLE:\n' + str(main_table))
game_tables = main_table.find_all('table', {'class': GAME_TABLES_CLASS})
#print('These are the GAME TABLES:\n' + str(game_tables))
for game in game_tables:
genres_str = str(game.select_one(GENRES_TAG_SELECTOR).contents[1]).strip().split(' / ')
for genre in genres_str:
if not genre in site_genres:
site_genres.append(genre)
write_genres_to_file(site_genres)
get_tags()
So, roughly, my question is: How do could i know when i'm done scraping all games starting with a certain letter in order to start scraping the games from the next one?.
NOTE: I only could think of comparing every time in the loop if returned html structure is the same compared with first page of a letter or maybe evaluating if I'm receiving repeated games. But i think this shouldn't the way i go about.
Any help is truly welcomed, and I'm very sorry for the very looong problem description, but thought that it was necessary.