I would like to read approximately 50,000 URL links of IMDB movie content. From those 50,000 URLs I am interested in the Genre tags of each movie. I have already developed a code to do that. But the major disadvantage of my code is that I have to download/read the whole HTML document of the URLs (cache them into RAM) and then extract the information I want. To achieve this I need to split the 50,000 links into batches of 5,000 links in order not to use all of my RAM. However, even if this works, it's not memory efficient.
Below check the code I currently use (with a sample URL):
import requests
import re
from tqdm.notebook import tqdm
from bs4 import BeautifulSoup
genres_links = ['http://www.imdb.com/title/tt0114709/'] #Toy Story movie of 1995
genres_url_list = []
myfield_genres = []
myfield_genres_final = []
genres = []
genres_final = []
headers = {"Range": "bytes=0-10"}
genres_url_list=[BeautifulSoup(requests.get(i, headers=headers).text) for i in tqdm(genres_links)]
myfield_genres=[i.find_all('div', {'class':'see-more inline canwrap'}) for i in tqdm(genres_url_list)]
myfield_genres_final=[[item[1]] if len(i)==2 else [item[0]] for i in tqdm(myfield_genres)]
r_genres = re.compile("(?=genres)(.*)")
genres=[j.find_all('a', {'href':r_genres}) for i in tqdm(myfield_genres_final) for j in i]
genres_final=[list(map(lambda x: x.text.strip(' ').replace('\n', ''), i)) for i in tqdm(genres)]
My question is how can I make this memory efficient by reading directly 'class':'see-more inline canwrap'
without reading the whole HTML document.
It's noteworthy that the class
mentioned returns two items, the Plot Keywords and the Genres. Since those items have the same class name (IMDB fault I guess), I have to filter the list by using the code line above myfield_genres_final=[[item[1]] if len(i)==2 else [item[0]] for i in tqdm(myfield_genres)]
A similar question asked here did not help me to figure this out.