Multi processes in a site scraping script

Question

As of today, I have a script that reads a table of data in a site, and loops through each row clicking on items to get to aditional data, goes back and repeat. A pseudo code would be as follows

browser=webdriver.Chrome()

node_list=FuncNode(browser)  #This function loops through each row and get in 
  #text the node identifier. This way, I don't lose the reference after 
  #clicking and going back due to changes in DOM

Once I have the list

for track_id in node_list:

   node=Search_for_node_in_main_page(track_id) #Now I have the row in a node

   #Get some data

   button=Get_row_button(node)

   button.click()

   #Now I change the focus onto the new tab, do some scraping, and write all 
   data to a MySQL database

   #Close new tab and focus back my browser on main tab 

   #end of the loop, repeat until the last item on list is scraped

This usually takes a while, so I was wondering how to optimize this with Multi Processing. From what I have read, the closest thing would be once I have the list, create a Pool, encase all code in one function and apply the pool to the list and that function

if __name__=='__main__':
  with Pool(4) as p:
    records = p.map(cool_function,node_list)

    p.terminate()
    p.join()

My issue here is, I'm actively using the browser here, so I guess that for each process I would have to open a different browser. If so, how could I reuse them? Mainly because the page is heavy on javascript and it takes a while to load, depending on the page, more than what it takes for 4-5 rows to be scraped.

Besides, considering it works somehow, would it have an effect on MySQL trying to write simultaneously on it from different processes?

So, in short, how could I make a multi process work here and optimize my initial script?

It sounds like you're trying to manually do what scrapy will do for you. If you don't need the browser I recommend going the scrapy route instead. — sehafoc, Aug 28 '18 at 01:37
Perhaps something along the lines of https://stackoverflow.com/q/8550114/3776268 The answer with the most votes is probably more viable. With selenium you will run into problems exactly like your're saying. While you can reuse the browser sessions you can only use them for one thing at a time. If you figure out which API calls you need and just call those instead your solution will be much faster — sehafoc, Aug 28 '18 at 01:54
For clarification, the web is [link](http://www.nowgoal.com), and provides soccer stats. This website doesn't offer a free API and is heavy on javascript. I think the only way to speed the process is if I'm somehow able to click from the main page and access several sheets at a time. As I said, the problem is that each click opens a new tab and I have to focus on it the browser, therefore that object is not available for a parallel click — puppet, Aug 28 '18 at 08:57
I took a look at the site, The main data comes from a API call to http://www.nowgoal.com/data/bf_en2.js?1535477608000, and then they do polling on a second interval to http://www.nowgoal.com/data/change_en.xml?1535477616000 Which looks like a time stamp in the request. The thing is, either you deal with the browser JS rendering overhead, or you use the API. — sehafoc, Aug 28 '18 at 17:38
Additionally, the data model they use is opaque... to get any meaning out of those APIs you'd have to spend a bit of time looking either at the JS code, or comparing values in the rendered output to the data structure to figure out what is what. Given the problem, I'm not sure if it will be easier to use selenium and deal with the overhead, or trying to figure out the API. I'd probably gravitate toward the API route though. — sehafoc, Aug 28 '18 at 18:00
I guess you mean trying to decipher how the API works, and instead of using a webdriver, try simple calls simultaneosly. The issue here is I looked at the structure a while ago and I would probably have no idea from where to start. — puppet, Aug 28 '18 at 20:43
Hmmm I see, You could in theory instantiate several web browsers, and establish queues to send them data then... that would be more efficient than starting a driver for every click. Look at selenium remote-control. Start up the desired number of browsers, then send your commands to them with the simple map you have above. That should at least prevent the load time for each browser? — sehafoc, Aug 29 '18 at 16:45

Multi processes in a site scraping script

0 Answers0