using dask for scraping via requests

Question

I like the simplicity of dask and would love to use it for scraping a local supermarket. My multiprocessing.cpu_count() is 4, but this code only achieves a 2x speedup. Why?

from bs4 import BeautifulSoup
import dask, requests, time
import pandas as pd

base_url = 'https://www.lider.cl/supermercado/category/Despensa/?No={}&isNavRequest=Yes&Nrpp=40&page={}'

def scrape(id):
    page = id+1; start = 40*page
    bs = BeautifulSoup(requests.get(base_url.format(start,page)).text,'lxml')
    prods = [prod.text for prod in bs.find_all('span',attrs={'class':'product-description js-ellipsis'})]
    prods = [prod.text for prod in prods]
    brands = [b.text for b in bs.find_all('span',attrs={'class':'product-name'})]

    sdf = pd.DataFrame({'product': prods, 'brand': brands})
    return sdf

data = [dask.delayed(scrape)(id) for id in range(10)]
df = dask.delayed(pd.concat)(data)
df = df.compute()

mdurant · Answer 1 · 2018-05-18T16:32:39.287

5

Firstly, a 2x speedup - hurray!

You will want to start by reading http://dask.pydata.org/en/latest/setup/single-machine.html

In short, the following three things may be important here:

you only have one network, and all the data has to come through it, so that may be a bottleneck
by default, you are using threads to parallelise, but the python GIL limits concurrent execution (see the link above)
the concat operation is happening in a single task, so this cannot be parallelised, and with some data types may be a substantial part of the total time. You are also drawing all the final data into your client's process with the .compute().

edited May 18 '18 at 16:32

answered May 15 '18 at 16:38

mdurant

27,272
5
45
74

2

Thank you Martin, very insightful comment. In particular "you have one network" has opened my brain. I shall proceed to implement this on Amazon Batch instead. – Sergio Lucero May 18 '18 at 14:40

score 0 · Answer 2 · answered May 30 '18 at 20:11

There are meaningful differences between multiprocessing and multithreading. See my answer here for a brief commentary on the differences. In your case that results in only getting a 2x speedup instead of, say, a 10x - 50x plus speedup.

Basically your problem doesn't scale as well by adding more cores as it would by adding threads (since it's I/O bound... not processor bound).

Configure Dask to run in multithreaded mode instead of multiprocessing mode. I'm not sure how to do this in dask but this documentation may help

using dask for scraping via requests

2 Answers2