How to accelerate this process of JavaScript web-page scraping?

Question

This python function aims to scrape a specific identifier (called as PMID) from a JavaScript web-page. When a URL is passed to the function, it gets the page using selenium. The code then tries to find the class "pubmedLink" within tag of html. If found, it returns the extracted PMID to another function.

This works fine, but is literally really slow. Is there a way to accelerate the process may be by using another parser or with a completely different method?

from selenium import webdriver


def _getPMIDfromURL_(url):

    driver = webdriver.Chrome('/usr/protoLivingSystematicReviews/drivers/chromedriver')
    driver.get(url)

    try:
        if driver.find_element_by_css_selector('a.pubmedLink').is_displayed():
            json_text = driver.find_element_by_css_selector('a.pubmedLink').text
            return json_text
    except:
        return "no_pmid"

    driver.quit()

Examples of the URL for the JS web-page,

@QHarr I think it must be accessible only through the universities in EFTA region. Dang! Others cannot access it. — PinkBanter, Feb 12 '19 at 12:15

Raydel Miranda · Accepted Answer · 2019-02-12T12:14:43.817

1

Well, selenium is fast, that's why is the favorite for many testers. On the other hand you could improve your code by parsing the content once instead two times.

The return value of the statement

 driver.find_element_by_css_selector('a.pubmedLink')

might by stored in a variable and use that variable. This will improve your speed about 1.5x.

try:
    elem =driver.find_element_by_css_selector('a.pubmedLink')
    if  elem.is_displayed():
        return elem.text
except:
    return "no_pmid

edited Feb 12 '19 at 12:14

answered Feb 12 '19 at 12:08

Raydel Miranda

13,825
3
38
60

Where do I parse the URL without the statement driver.get(url) ? – PinkBanter Feb 12 '19 at 13:01
I get what you mean now, but this statement (if elem.is_displayed():) does't work. I tried it without the statement and it functions a bit better, but is there a way to just parse this ID and not the entire url? Any hints welcome. – PinkBanter Feb 12 '19 at 13:13
No, the is no way you can parse just that ID, all the text (at least until that ID is reached) has to be parsed, that's the way it can later get elements. – Raydel Miranda Feb 13 '19 at 07:13

score 0 · Answer 2 · answered Feb 12 '19 at 11:57

0

You can try phantomjs, its faster: https://realpython.com/headless-selenium-testing-with-python-and-phantomjs/

answered Feb 12 '19 at 11:57

Felix Martinez

512
5
9

"Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead" – PinkBanter Feb 12 '19 at 16:01
See: https://stackoverflow.com/questions/46753393/how-to-make-firefox-headless-programmatically-in-selenium-with-python options.headless = True – Felix Martinez Feb 12 '19 at 16:19
Yes, I tried it too. the speed seems to be the same or worse. – PinkBanter Feb 12 '19 at 16:28

How to accelerate this process of JavaScript web-page scraping?

2 Answers2