Scraping site with python that requires javascript input

Question

I am attempting to scrape a website using the following python code

import re
import requests

def get_csrf(page):
    matchme = r'name="csrfToken" value="(.*)" /'
    csrf = re.search(matchme, str(page))
    csrf = csrf.group(1)
    return csrf

def login():
    login_url = 'https://www.edline.net/InterstitialLogin.page'

    with requests.Session() as s:
        login_page = s.get(login_url)
        csrf = get_csrf(login_page.text)

        username = 'USER'
        password = 'PASS'

        login = {'screenName': username,
                 'kclq': password,
                 'csrfToken': csrf,
                 'TCNK':'authenticationEntryComponent',
                 'submitEvent':'1',
                 'enterClicked':'true',
                 'ajaxSupported':'yes'}
        page = s.post(login_url, data=login)
        r = s.get("https://www.edline.net/UserDocList.page?")
        print(r.text)

login()

This code logs into https://www.edline.net/InterstitialLogin.page successfully, but fails when I try to do

r = s.get("https://www.edline.net/UserDocList.page?")
print(r.text)

It doesn't print the expected page, instead it throws an error. Upon further testing I discovered that it throws this error even if you try to go directly to the page from a browser. This means that the only way to access the page is to run the code executed when the button is clicked to go there. So when I investigated the page source I found that the button used to link to the page I'm trying to scrape uses the following code

<a href="javascript:submitEvent('viewUserDocList', 'TCNK=headerComponent')" tabindex="-1">Private Reports</a>

So essentially I am looking for a way to trigger the above javascript code in python in order to scrape the resulting page.

use [selenium](http://selenium-python.readthedocs.io/getting-started.html) as it lets you interact with the page using python in the same way as a user on browser would. — tihom, Dec 27 '16 at 01:47
use `DevTools` in Chrome/Firefox to see what values and url is used by browser when you click this button. — furas, Dec 27 '16 at 02:04
in DevTools is tab "Network" and you can see all requests send from browser to server. You can use button "clear" to remove all requests before you click link on page - and then you should see all requests send after you click link. — furas, Dec 27 '16 at 02:15

score 0 · Answer 1 · answered Dec 27 '16 at 10:41

Since the website uses javascript, you need something like selenium that visits the page using a browser. The following code will log in to edline just like your other code did:

from selenium import webdriver
import time
driver = webdriver.Firefox() #any browser really
url = 'https://www.edline.net/InterstitialLogin.page'
driver.get(url)
username_text = driver.find_element_by_xpath('//*[@id="screenName"]') #finds the username text box
username_text.send_keys('username') #sends 'username' to the username text box
password_text = driver.find_element_by_xpath('//*[@id="kclq"]') #finds the password text box
password_text.send_keys('password') # sends 'password' to the password text box
click_button = 
driver.find_element_by_xpath('/html/body/form[3]/div/div[2]/div/div[1]/div[3]/button').click() #finds the submit button and clicks on it

Once you logged in, it will be possible to get the full expected page. It's really easy to find out how with the Selenium documentation! Let me know if you have further questions.

is there anyway to do this same thing only without making it bring up the browser? could I make it do it in the background somehow? — John Doe, Dec 28 '16 at 04:02
You don't need to do it some other way. You can hide the browser if you want. http://stackoverflow.com/questions/16180428/can-selenium-webdriver-open-browser-windows-silently-in-background — titusAdam, Dec 28 '16 at 08:30

Scraping site with python that requires javascript input

1 Answers1