-1

I try to scrape a website using bs4. but the code is very very slow because there are many tr to get one by one.

  1. I tried loop but take more than 3 minute to scrape 3000 tr (1000 row):
ROWS = []
#soup.select('tr[id^=mix]') is a list of html elements
for tr in soup.select('tr[id^=mix]'): 
    dt = tr.select_one('.h').text 
    H_team = tr.select_one('td.Home').text.strip()
    A_team = tr.select_one('td.Away').text.strip()
    #....

    row = [dt, H_team, A_team, ...]    
    ROWS.append(row)
    print(row)
  1. I tried List Comprehension but didn't change the speed (even slower):
def my_funct(tr):
    dt = tr.select_one('.h').text 
    H_team = tr.select_one('td.Home').text.strip()
    A_team = tr.select_one('td.Away').text.strip()
        
    row = [dt, H_team, A_team]    
    return row 

ROWS = [my_funct(tr) for tr in soup.select('tr[id^=mix]')]
  1. I tried multiprocessing module, but the speed is the same
from multiprocessing.dummy import Pool as ThreadPool

def my_funct(tr):
    dt = tr.select_one('.h').text 
    H_team = tr.select_one('td.Home').text.strip()
    A_team = tr.select_one('td.Away').text.strip()
        
    row = [dt, H_team, A_team]    
    return row 

pool = ThreadPool(4)
ROWS = pool.map(my_funct, soup.select('tr[id^=mix]'))

pool.close()
pool.join()
  1. I tried asyncio, but didn't work (return error)
import asyncio

async def my_funct(tr):
    dt = tr.select_one('.h').text 
    H_team = tr.select_one('td.Home').text.strip()
    A_team = tr.select_one('td.Away').text.strip()
        
    row = [dt, H_team, A_team]    
    return row 

async def s():
    await asyncio.gather(*[my_funct(tr) for tr in soup.select('tr[id^=Today]')]) 

asyncio.run(s())

#return error: "RuntimeError: asyncio.run() cannot be called from a running event loop"

How can I run the scraping of rows in parallel so my code doesn't take a long time to process each row one by one ?

chitown88
  • 27,527
  • 4
  • 30
  • 59
khaled koubaa
  • 836
  • 3
  • 14
  • 2
    whats the url? I shouldn't really take that long to parse a table. – chitown88 Jun 13 '22 at 14:18
  • the table columns are more than that, I just put 3 columns here – khaled koubaa Jun 13 '22 at 14:24
  • whats the url you trying to get the data from? – chitown88 Jun 13 '22 at 14:30
  • It's hard to debug and test for a solution if you don't share more info. If you share the url, we can see if there's an api instead of parse the html. – chitown88 Jun 13 '22 at 16:13
  • it's sports betting website, unfortunately no api – Khaled Koubaa Jun 13 '22 at 16:34
  • 1
    ok, so then share the url and tell us what data you are after. Again not much anyone can do beyond what you've already tried with out knowing what we are dealing with. – chitown88 Jun 16 '22 at 11:16
  • Have you tried `ProcessPool` instead of `ThreadPool`? – aaron Jun 16 '22 at 11:46
  • [This question](https://stackoverflow.com/questions/25539330/speeding-up-beautifulsoup) is a good question template (as @chitown88 said, we really need a sample url) _as well as_ a resource for speedup hints. Some notes: 1.) use `lxml` parser, 2.) make use of [`SoupStrainer`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#parsing-only-part-of-a-document) (hunch: CSS selectors using the "startswith" syntax may be slow. Just use `SoupStrainer`.), 3.) this code is CPU-bound (not IO) [You probably have other code that makes http requests that will be IO-bound.]. – webelo Jun 16 '22 at 13:07
  • Also, what is "very very slow"? We talking a minute? 10 minutes? and hour? – chitown88 Jun 17 '22 at 13:37
  • @chitown88 more than 10minutes – khaled koubaa Jun 18 '22 at 09:42
  • @khaledkoubaa ya that's quite lengthy. Can you not share the url? – chitown88 Jun 18 '22 at 09:47
  • @chitown88 I can't share the url, unfortunately – khaled koubaa Jun 18 '22 at 09:51
  • 1
    Then it's really hard for anyone to help you out with this. The best can do is refer to @webelo's comment, read the link provided there, and try his 3 options. Good luck. – chitown88 Jun 18 '22 at 09:57
  • https://stackoverflow.com/questions/68988489/how-to-run-selenium-chromedriver-in-multiple-threads Check this answer here. – NBG Jun 20 '22 at 05:51
  • @NBG I think the answer you referenced will not help here. The question assumes that the page data is already fetched from the web. Selenium is only useful for fetching data from the web in _certain_ cases (e.g. when the page is loaded dynamically after initial load). Selenium does nothing to parse the fetched webpage. – webelo Jun 21 '22 at 15:49
  • maybe this can help you https://stackoverflow.com/questions/23377533/python-beautifulsoup-parsing-table wither the pandas version or the parsing table version – Bruno Carballo Jun 21 '22 at 22:28
  • You never used multi-processing, you used *multithreading*, which for a CPU bound task, won't be faster (likely, slower) – juanpa.arrivillaga Jun 22 '22 at 21:45
  • You are most likely asking for help at the wrong step. You need a step back on how is the page rendered, is there network traffic populating the table, is it easier to use other library(pandas.read_html) or parser (lxml)? Without URL all helps here will be blind guesses: BTW is it betting sports website? – Prayson W. Daniel Jun 23 '22 at 05:00

2 Answers2

1

When it comes to performance there are generally two major reasons for why there might be bottlenecks which are compute or I/O.

I'll assume that web pages are fully loaded during the scraping process which would eliminate network I/O being an issue. If that is not true and the webpages being scraped are paginated, it would be best to first cache all of these pages in memory to improve performance when processing.

It looks like you have tried to perform multiprocessing using threads. Threads use the same memory of the process they belong to which is good since it reduces inter process communication overhead. However due to python's global interpreter lock this will not improve performance in cpu bound python applications as the workload is bottlenecked by a single thread running at a time.

The fact that it performs slower on your dataset is expected as there is now a slight overhead in managing and context switching between threads.

Try switching:

pool = ThreadPool(4)

to

pool = Pool(4) # number of processors available

Benchmarking a smaller dataset with various number of processors may help identify the optimal amount.

Poiuy
  • 331
  • 2
  • 6
0

As you do not share your data, I can only guess: you try to parse a big table with many N columns and M=3000 rows.

Your current implementation calls select_one("td...") M x N times. This may be what makes your code slow.

What you can try: Get each row with one select.

ls = []
for tr in soup.select('tr[id^=mix]'): 
    row = [td.get_text(strip=True) for td in tr.select('td')]
    ls.append(row)
df = pd.Dataframe(ls, columns=["...in the right order.."])

Here is a benchmark

def generate_html_table(n_row, n_col):
    tds = [f'<td class="c{i}">{i}</td>' for i in range(n_col)]
    tr = "<tr>" + "".join(tds) + "</tr>"
    table = "<table>" + tr * n_row + "</table>"
    return table

## Generate a html table of 1000 x 20, with class attribute for each column
M, N = 1000, 20
soup = BeautifulSoup(generate_html_table(M, N))

columns = [f"col_{i}" for i in range(N)]

# OP approach: one .select_one for each td
def parse_table_1(soup):
    rows = []
    for tr in soup.select('tr'): 
        row = [tr.select_one(f'td.c{i}').get_text(strip=True) for i in range(N)]
        rows.append(row)
    return pd.DataFrame(rows, columns=columns)

# Proposed approach: one .select for each row
def parse_table_2(soup):
    rows = []
    for tr in soup.select('tr'): 
        row = [td.get_text(strip=True) for td in tr.select("td")]
        rows.append(row)
    return pd.DataFrame(rows, columns=columns)

Results: 10X speed up for a 1000 x 20 table

%timeit parse_table_1(soup)
3.69 s ± 328 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit parse_table_2(soup)
351 ms ± 39.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

For 1000 x 100 table, it's 35x speed up

1min 9s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
1.88 s ± 175 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
phi
  • 10,572
  • 3
  • 21
  • 30
  • I think this answer needs some supporting rationale. I’m not seeing any speedups here. Point 1 seems to rearrange steps without addressing the root cause. Point 2 should not be pursued: creating DataFrames at every step is a sure way to slow down code. (I love pandas until I need my data fast and uniform.) – webelo Jun 22 '22 at 20:32
  • I updated my answer with rationale. The Dataframe is created ONCE. Not at every step. – phi Jun 22 '22 at 20:42
  • 2
    First things first: I think we can all agree that OP needs to provide a url before anyone can provide an answer? I’m not going to nitpick your answer but I do think the edited answer could benefit from more inspection of the snippets that OP did provide. I didn’t downvote your original answer. I’m going to exit the convo here. – webelo Jun 22 '22 at 20:58