I try to scrape a website using bs4. but the code is very very slow because there are many tr to get one by one.
- I tried loop but take more than 3 minute to scrape 3000 tr (1000 row):
ROWS = []
#soup.select('tr[id^=mix]') is a list of html elements
for tr in soup.select('tr[id^=mix]'):
dt = tr.select_one('.h').text
H_team = tr.select_one('td.Home').text.strip()
A_team = tr.select_one('td.Away').text.strip()
#....
row = [dt, H_team, A_team, ...]
ROWS.append(row)
print(row)
- I tried List Comprehension but didn't change the speed (even slower):
def my_funct(tr):
dt = tr.select_one('.h').text
H_team = tr.select_one('td.Home').text.strip()
A_team = tr.select_one('td.Away').text.strip()
row = [dt, H_team, A_team]
return row
ROWS = [my_funct(tr) for tr in soup.select('tr[id^=mix]')]
- I tried multiprocessing module, but the speed is the same
from multiprocessing.dummy import Pool as ThreadPool
def my_funct(tr):
dt = tr.select_one('.h').text
H_team = tr.select_one('td.Home').text.strip()
A_team = tr.select_one('td.Away').text.strip()
row = [dt, H_team, A_team]
return row
pool = ThreadPool(4)
ROWS = pool.map(my_funct, soup.select('tr[id^=mix]'))
pool.close()
pool.join()
- I tried asyncio, but didn't work (return error)
import asyncio
async def my_funct(tr):
dt = tr.select_one('.h').text
H_team = tr.select_one('td.Home').text.strip()
A_team = tr.select_one('td.Away').text.strip()
row = [dt, H_team, A_team]
return row
async def s():
await asyncio.gather(*[my_funct(tr) for tr in soup.select('tr[id^=Today]')])
asyncio.run(s())
#return error: "RuntimeError: asyncio.run() cannot be called from a running event loop"
How can I run the scraping of rows in parallel so my code doesn't take a long time to process each row one by one ?