0

I have the following scraping function already implemented in serial, but, because There are multiple URLs with data I would like to parallelize some of the work. here is the working Serial code:

from bs4 import BeautifulSoup as bs
import requests

edbURL='URL1'
psnURL='URL2'

def urlScraper(URL):
    page=requests.get(URL)
    soup=bs(page.text,'lxml')
    l = ['base_URL'+str(i.a['href']) for i in soup.find_all('div',class_='info')]
    return l

edbs=urlScraper(edbURL)
psns=urlScraper(psnURL)

What I would like for the two calls to urlScraper(URL) to each get their own thread and run in parallel, I tried using the threads library but only got some big nasty int returns with the following syntax:

edbs = threads.start_new_thread(urlScraper,(edbURL,))
psns = threads.start_new_thread(urlScraper,(psnURL,))

I figure it has something to do with the return in urlScraper(URL), then again, I basically know almost nothing about anything. Thanks for any help everyone!

ThisGuyCantEven
  • 1,095
  • 12
  • 21
  • If you are hoping that they will actually work in parallel and will therefore be fast then it doesn't happen in python. Threads dont actually run in parallel. See this: http://stackoverflow.com/questions/1697571/python-threading-appears-to-run-threads-sequentially – shiva Mar 06 '17 at 18:13
  • I suppose if I actually wanted them in parallel I could use cython and openmp? – ThisGuyCantEven Mar 06 '17 at 18:14
  • You can use multiprocessing to run them in parallel. – shiva Mar 06 '17 at 18:18
  • How does the performance of multiprocessing compare to that of cython/openmp? It seems like it must use subprocess or some other library which to me means extra overhead, then again, it also means less manual memory management. – ThisGuyCantEven Mar 07 '17 at 21:43

1 Answers1

-2

multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows.

https://docs.python.org/2/library/multiprocessing.html

Aaron
  • 2,154
  • 5
  • 29
  • 42