Read specific 'div' and 'class' property from a url without reading the whole HTML document

Question

I would like to read approximately 50,000 URL links of IMDB movie content. From those 50,000 URLs I am interested in the Genre tags of each movie. I have already developed a code to do that. But the major disadvantage of my code is that I have to download/read the whole HTML document of the URLs (cache them into RAM) and then extract the information I want. To achieve this I need to split the 50,000 links into batches of 5,000 links in order not to use all of my RAM. However, even if this works, it's not memory efficient.

Below check the code I currently use (with a sample URL):

import requests
import re
from tqdm.notebook import tqdm
from bs4 import BeautifulSoup
genres_links = ['http://www.imdb.com/title/tt0114709/'] #Toy Story movie of 1995
genres_url_list = []
myfield_genres = []
myfield_genres_final = []
genres = []
genres_final = []
headers = {"Range": "bytes=0-10"}

genres_url_list=[BeautifulSoup(requests.get(i, headers=headers).text) for i in tqdm(genres_links)]
myfield_genres=[i.find_all('div', {'class':'see-more inline canwrap'}) for i in tqdm(genres_url_list)]
myfield_genres_final=[[item[1]] if len(i)==2 else [item[0]] for i in tqdm(myfield_genres)]

r_genres = re.compile("(?=genres)(.*)")
genres=[j.find_all('a', {'href':r_genres}) for i in tqdm(myfield_genres_final) for j in i]
genres_final=[list(map(lambda x: x.text.strip(' ').replace('\n', ''), i)) for i in tqdm(genres)]

My question is how can I make this memory efficient by reading directly 'class':'see-more inline canwrap' without reading the whole HTML document.

It's noteworthy that the class mentioned returns two items, the Plot Keywords and the Genres. Since those items have the same class name (IMDB fault I guess), I have to filter the list by using the code line above myfield_genres_final=[[item[1]] if len(i)==2 else [item[0]] for i in tqdm(myfield_genres)]

A similar question asked here did not help me to figure this out.

Have you considered not doing it with web scraping? Use something that's better structured for this task, e.g. using https://www.themoviedb.org/ API. — jonrsharpe, Feb 14 '21 at 14:29
@jonrsharpe my links are strictly from IMDB because of the Grouplens dataset. They provide IMDB and TMDB ids. However, I preferred the IMDB link that's why I stick with them. TMDB is an option but my urls are from IMDB. Since I have already downloaded data and content from IMDB I can't move to another API. But your recommendation is a valid option because I have both IMDB and TMDB ids. — NikSp, Feb 14 '21 at 14:50
servers don't have function to send you only selected `div`. For some servers you may use header [Range](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Range) to read only part of HTML - but if every page may have `'class':'see-more inline canwrap'` in different place so you can't read only part of HTML from server. Web pages wasn't created to send only some parts but only full pages. Servers use API to send only data without HTML and make it simpler. — furas, Feb 14 '21 at 18:08
I don't know why you use `tqdm` - every `print` or other method to display message only slow down code. Sometimes removing `print()` can be the best method to make it faster. — furas, Feb 14 '21 at 18:15
as for me "memory efficient" can be read only few HTMLs, get Gender, and `del` these HTML from memory after use, and save every result to file/database without keeping in memory. But with saving it may work slower. — furas, Feb 14 '21 at 18:19
@furas ty for the replies. Indeed deleting the HTML documents is an option. I will keep the notes about tqdm and print(). Didn't know that tqdm may slow operations. I use to have a look on which URL an error may occur. So it's important to me that I use it for debugging in case the codes stops. — NikSp, Feb 14 '21 at 20:11
when you run loop and `print()` text for every item in loop then it may slow down code. But if you `print()` every ie. 100 items then it slows down little less - and this is method to make `print()` faster. I assume `tqdm` may works this way - it may display new text every few items so it slows down not so much. — furas, Feb 14 '21 at 20:56

Read specific 'div' and 'class' property from a url without reading the whole HTML document

0 Answers0