I have the following scraping function already implemented in serial, but, because There are multiple URLs with data I would like to parallelize some of the work. here is the working Serial code:
from bs4 import BeautifulSoup as bs
import requests
edbURL='URL1'
psnURL='URL2'
def urlScraper(URL):
page=requests.get(URL)
soup=bs(page.text,'lxml')
l = ['base_URL'+str(i.a['href']) for i in soup.find_all('div',class_='info')]
return l
edbs=urlScraper(edbURL)
psns=urlScraper(psnURL)
What I would like for the two calls to urlScraper(URL)
to each get their own thread and run in parallel, I tried using the threads
library but only got some big nasty int
returns with the following syntax:
edbs = threads.start_new_thread(urlScraper,(edbURL,))
psns = threads.start_new_thread(urlScraper,(psnURL,))
I figure it has something to do with the return
in urlScraper(URL)
, then again, I basically know almost nothing about anything. Thanks for any help everyone!