As of today, I have a script that reads a table of data in a site, and loops through each row clicking on items to get to aditional data, goes back and repeat. A pseudo code would be as follows
browser=webdriver.Chrome()
node_list=FuncNode(browser) #This function loops through each row and get in
#text the node identifier. This way, I don't lose the reference after
#clicking and going back due to changes in DOM
Once I have the list
for track_id in node_list:
node=Search_for_node_in_main_page(track_id) #Now I have the row in a node
#Get some data
button=Get_row_button(node)
button.click()
#Now I change the focus onto the new tab, do some scraping, and write all
data to a MySQL database
#Close new tab and focus back my browser on main tab
#end of the loop, repeat until the last item on list is scraped
This usually takes a while, so I was wondering how to optimize this with Multi Processing. From what I have read, the closest thing would be once I have the list, create a Pool, encase all code in one function and apply the pool to the list and that function
if __name__=='__main__':
with Pool(4) as p:
records = p.map(cool_function,node_list)
p.terminate()
p.join()
My issue here is, I'm actively using the browser here, so I guess that for each process I would have to open a different browser. If so, how could I reuse them? Mainly because the page is heavy on javascript and it takes a while to load, depending on the page, more than what it takes for 4-5 rows to be scraped.
Besides, considering it works somehow, would it have an effect on MySQL trying to write simultaneously on it from different processes?
So, in short, how could I make a multi process work here and optimize my initial script?