2

I am scraping a webpage using Selenium in Python. I am able to locate the elements using this code:

from selenium import webdriver
import codecs

driver = webdriver.Chrome()
driver.get("url")
results_table=driver.find_elements_by_xpath('//*[@id="content"]/table[1]/tbody/tr')

Each element in results_table is in turn a set of sub-elements, with the number of sub-elements varying from element to element. My goal is to output each element, as a list or as a delimited string, into an output file. My code so far is this:

results_file=codecs.open(path+"results.txt","w","cp1252")

for element in enumerate(results_table):
    element_fields=element.find_elements_by_xpath(".//*[text()][count(*)=0]")
    element_list=[field.text for field in element_fields]
    stuff_to_write='#'.join(element_list)+"\r\n"
    results_file.write(stuff_to_write)
    #print (i)
results_file.close()
driver.quit()

This second part of code takes about 2.5 minutes on a list of ~400 elements, each with about 10 sub-elements. I get the desired output, but it is too slow. What could I do to improve the prformance ?

Using python 3.6

horace_vr
  • 3,026
  • 6
  • 26
  • 48
  • 2
    Download the whole page in one shot, then use something like BeautifulSoup to process it. I haven't used splinter or selenium in a while, but in splinter, .html will give you the page. I'm not sure what the syntax is for that in selenium, but there should be a way to grab the whole page. – GaryMBloom Dec 06 '17 at 07:23
  • I am using selenium because I need to scrapuktiple pages on a website where login is needed, and I would like to avoid logging in once for each page. BeautifulSoup is an option, but I do not know how toake it grab the active chromedriver page. And still, learning-wise, I must be doing something structurally wrong in my code – horace_vr Dec 06 '17 at 07:55
  • @horace_vr Does it speed up if you write to the file only once at the end, after the for loop instead of inside each iteration? – Grasshopper Dec 06 '17 at 08:59
  • @Grasshopper No. Already tried that... – horace_vr Dec 06 '17 at 09:01
  • 2
    Selenium (and Splinter, which is layered on top of Selenium) are notoriously slow for randomly accessing web page content. Looks like `driver.page_source` may give the entire contents of the page in Selenium, which I found at https://stackoverflow.com/questions/35486374/how-to-get-the-entire-web-page-source-using-selenium-webdriver-in-python. If reading all the chunks on the page one at a time is killing your performance (and it probably is), reading the whole page once and processing it offline will be oodles faster. – GaryMBloom Dec 06 '17 at 13:31
  • 1
    @Gary02127 BeautifulSoup is the way to go; I tried it, based on your suggestion, and replaced the webdriver-based processing code, and instead of 2 minutes, the code is executed in a handful of seconds. If you elaborate and post an answer, I will accept it. It certainly answered my OP, although not a solution I had in mind when posting :) – horace_vr Dec 06 '17 at 21:37
  • @horace_vr - Thanks, Horace! I just posted that answer. : ) – GaryMBloom Dec 06 '17 at 21:41

1 Answers1

1

Download the whole page in one shot, then use something like BeautifulSoup to process it. I haven't used splinter or selenium in a while, but in Splinter, .html will give you the page. I'm not sure what the syntax is for that in Selenium, but there should be a way to grab the whole page.

Selenium (and Splinter, which is layered on top of Selenium) are notoriously slow for randomly accessing web page content. Looks like .page_source may give the entire contents of the page in Selenium, which I found at stackoverflow.com/questions/35486374/…. If reading all the chunks on the page one at a time is killing your performance (and it probably is), reading the whole page once and processing it offline will be oodles faster.

GaryMBloom
  • 5,350
  • 1
  • 24
  • 32