Run a scraping function on 2 threads for 2 URLs Python

Question

I have the following scraping function already implemented in serial, but, because There are multiple URLs with data I would like to parallelize some of the work. here is the working Serial code:

from bs4 import BeautifulSoup as bs
import requests

edbURL='URL1'
psnURL='URL2'

def urlScraper(URL):
    page=requests.get(URL)
    soup=bs(page.text,'lxml')
    l = ['base_URL'+str(i.a['href']) for i in soup.find_all('div',class_='info')]
    return l

edbs=urlScraper(edbURL)
psns=urlScraper(psnURL)

What I would like for the two calls to urlScraper(URL) to each get their own thread and run in parallel, I tried using the threads library but only got some big nasty int returns with the following syntax:

edbs = threads.start_new_thread(urlScraper,(edbURL,))
psns = threads.start_new_thread(urlScraper,(psnURL,))

I figure it has something to do with the return in urlScraper(URL), then again, I basically know almost nothing about anything. Thanks for any help everyone!

If you are hoping that they will actually work in parallel and will therefore be fast then it doesn't happen in python. Threads dont actually run in parallel. See this: http://stackoverflow.com/questions/1697571/python-threading-appears-to-run-threads-sequentially — shiva, Mar 06 '17 at 18:13
I suppose if I actually wanted them in parallel I could use cython and openmp? — ThisGuyCantEven, Mar 06 '17 at 18:14
How does the performance of multiprocessing compare to that of cython/openmp? It seems like it must use subprocess or some other library which to me means extra overhead, then again, it also means less manual memory management. — ThisGuyCantEven, Mar 07 '17 at 21:43

score -2 · Answer 1 · answered Mar 06 '17 at 18:15

multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows.

https://docs.python.org/2/library/multiprocessing.html

Run a scraping function on 2 threads for 2 URLs Python

1 Answers1