-1

I tried to parse 50 000 url using Beautifulsoup in Python. The parsing works in loop:

I figure out that time for parsing one page is 15 - 18 seconds. From page I grab around 20 elements.

Why does Beautifulsoup work so slow? How to accelerate Beautifulsoup in Python?

Hamama
  • 177
  • 4
  • 16

2 Answers2

6

Make sure you understand your bottlenecks.

The very first and the main problem is not the HTML parsing - it is "The parsing works in loop".

Which means that the code is synchronous/blocking - you are not processing the next URL until you are done with the current one. This is absolutely not scalable.

To solve this, switch to an asynchronous approach - switch to, for example, Scrapy web-scraping framework - this is currently the most natural move for scaling web-scraping projects.

Also see:

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
1

Parallelize your processing.

e.g.

import Queue
import threading 

# will execute parrallel
def taskProcess(q, url):
    q.put(beautifulSoupFunction(url))

urls = ["http://url1.com", "http://url2.com"]

q = Queue.Queue()

for u in urls:
    t = threading.Thread(target=taskProcess, args = (q,u))
    t.daemon = True
    t.start()

s = q.get()
print s
Ajeet Ganga
  • 8,353
  • 10
  • 56
  • 79
  • Need I just put this in thread? `q.put(beautifulSoupFunction(url))`? Or all code below as: `title = soup.select('.document-title > .id-app-title')[0].text`? – Hamama Dec 21 '16 at 06:59
  • I mean how to put this code to `Q`: ` soup = BeautifulSoup(content, 'html.parser') web_site = "" title = soup.select('.document-title > .id-app-title')[0].text` – Hamama Dec 21 '16 at 07:01
  • Frankly, that is a different question, not that I don't like to answer that. – Ajeet Ganga Dec 21 '16 at 07:07
  • In Python 3 I get error: `NameError: name 'Thread' is not defined` – Hamama Dec 21 '16 at 07:47
  • Can you execute following reply ? >>> import threading >>> def p(s): ... print s ... >>> t = threading.Thread(target=p, args=(["Hello"])) >>> t.run > >>> t.run() Hello >>> – Ajeet Ganga Dec 22 '16 at 06:18