5

I like the simplicity of dask and would love to use it for scraping a local supermarket. My multiprocessing.cpu_count() is 4, but this code only achieves a 2x speedup. Why?

from bs4 import BeautifulSoup
import dask, requests, time
import pandas as pd

base_url = 'https://www.lider.cl/supermercado/category/Despensa/?No={}&isNavRequest=Yes&Nrpp=40&page={}'

def scrape(id):
    page = id+1; start = 40*page
    bs = BeautifulSoup(requests.get(base_url.format(start,page)).text,'lxml')
    prods = [prod.text for prod in bs.find_all('span',attrs={'class':'product-description js-ellipsis'})]
    prods = [prod.text for prod in prods]
    brands = [b.text for b in bs.find_all('span',attrs={'class':'product-name'})]

    sdf = pd.DataFrame({'product': prods, 'brand': brands})
    return sdf

data = [dask.delayed(scrape)(id) for id in range(10)]
df = dask.delayed(pd.concat)(data)
df = df.compute()
Sergio Lucero
  • 862
  • 1
  • 12
  • 21

2 Answers2

5

Firstly, a 2x speedup - hurray!

You will want to start by reading http://dask.pydata.org/en/latest/setup/single-machine.html

In short, the following three things may be important here:

  • you only have one network, and all the data has to come through it, so that may be a bottleneck
  • by default, you are using threads to parallelise, but the python GIL limits concurrent execution (see the link above)
  • the concat operation is happening in a single task, so this cannot be parallelised, and with some data types may be a substantial part of the total time. You are also drawing all the final data into your client's process with the .compute().
mdurant
  • 27,272
  • 5
  • 45
  • 74
  • 2
    Thank you Martin, very insightful comment. In particular "you have one network" has opened my brain. I shall proceed to implement this on Amazon Batch instead. – Sergio Lucero May 18 '18 at 14:40
0

There are meaningful differences between multiprocessing and multithreading. See my answer here for a brief commentary on the differences. In your case that results in only getting a 2x speedup instead of, say, a 10x - 50x plus speedup.

Basically your problem doesn't scale as well by adding more cores as it would by adding threads (since it's I/O bound... not processor bound).

Configure Dask to run in multithreaded mode instead of multiprocessing mode. I'm not sure how to do this in dask but this documentation may help

zelusp
  • 3,500
  • 3
  • 31
  • 65