2

I'm trying to extract csu employee salary data from this webpage (http://www.sacbee.com/statepay/#req=employee%2Fsearch%2Fname%3D%2Fyear%3D2013%2Fdepartment%3DCSU%20Sacramento). I've tried using urlib2 and requests library, but none of them returned the actual table from the webpage. I guessed the reason could be that the table was generated dynamically by javascript. Below is my code using requests.

from lxml import html
import requests

page = requests.get("http://www.sacbee.com/statepay/#req=employee%2Fsearch%2Fname%3D%2Fyear%3D2013%2Fdepartment%3DCSU%20Sacramento")
tree = html.fromstring(page.text)
name = tree.xpath('//table/tbody/tr/td[2]/text()'

Any help/comments will be highly appreciated.

WGS
  • 13,969
  • 4
  • 48
  • 51
jinlong
  • 839
  • 1
  • 9
  • 19
  • If you inspect the page, the info's actually a JSON file. I hope you know what this means. :D – WGS Apr 08 '14 at 22:25
  • Thanks Nanashi! I know how to handle json files, but could you point me to the json file? I was not able to find the url to the json file within the webpage. – jinlong Apr 08 '14 at 22:43
  • 1
    The URL is `http://api.sacbeelabs.com/v1/statepay/employee/search/name=/year=2013/department=CSU%20Sacramento.json`. However, you need to make a POST request for this, because it'll just return the following in Python: `{u'status': {u'message': u'Unauthorized', u'code': 401, u'reason': u'Client origin not specified'}, u'request': {u'verb': u'statepay/employee/search/name=/year=2013/department=CSU%20Sacramento', u'params': [], u'format': u'json'}} `. – WGS Apr 08 '14 at 22:48
  • Hi Nanashi, how did you find the json file? – user3314418 Apr 21 '14 at 18:00

2 Answers2

2

Just took a quick look on the website you mentioned. It is indeed due to the fact that the table is loaded in using javascript. SO it is not actually part of the website you are requesting in your script.

To fix this, you'll probably have to look into the webrequests made by the website and find the one that retrieves the data of the table. It is not hard too do, just a nuisance. Take a look here; similar question. Hope it helps!

Community
  • 1
  • 1
Erwin
  • 3,298
  • 2
  • 15
  • 22
2

Here's my attempt on it, as per my comment. Note that I only pulled out one line of data. All else is up to you.

Code:

import requests as rq

url = "http://api.sacbeelabs.com/v1/statepay/employee/search/name=/year=2013/department=CSU%20Sacramento.json"
data = "74XoegZ494trsvrus_As4B4handjZ494-Adl4B4olg494dnnk933pppAmWYXaaAYjh3mnWnakWq3-Ela-B-Oahkgjqaa07tw8tJmaWlYd07tw8tJiWha07tw8uH07tw8tJqaWl07tw8uHtrsu07tw8tJZakWlnhain07tw8uHGT-107tw8trTWYlWhainj4B4labalal494dnnk933mnWYfj-8albgjpAYjh3-Boamnejim3tt_v_rt_3YlWpgeic1nWXgam1bljh1paXkWca4B4nenga494TnWnaDVjlfalDTWgWlqDTaWlYdD1DUdaDTWYlWhainjDFaaBDTWYlWhainjBDGWgebjlieW4B4mYlV49sxzrB4mYlL49srwrB4peiV49sxzrB4peiL49_stB4oW4974Wcain494Oj-CeggW3wArD-I-6ss-MD-1Xoino-MDNeio-AD-Azx2xv-MDl-89tzAr-JDKaYfj3trsrrsrsDJelabj-A3tzAr4B4njoYd49bWgmaB4Zjh4954mnjlWca4B4WiehWneji4B4YWi-8WmtZ4B4paXmjYfan4B4pjlfal4B4WoZej4B4-8eZaj4B4m-8c4B4cajgjY46B4Ymm4954WiehWneji4B4nlWimbjlh468B4omal4974Woi494Koamn488"
headers = {
'Host': 'api.sacbeelabs.com',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'X-SBAPI-Auth-Token': '0QNWbefXw6fQQcWXqK8vDw',
'X-SBAPI-SID': '3gbRqglHXAVDy1vwdcVVMf',
'X-SBAPI-CID': '2HuWho39ZcDUlTswYSWUd9',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Referer': 'http://www.sacbee.com/statepay/',
'Content-Length': '684',
'Origin': 'http://www.sacbee.com',
'Cookie': 'sbapi-cid=2HuWho39ZcDUlTswYSWUd9; sbapi-sid=3gbRqglHXAVDy1vwdcVVMf',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache'
}

r = rq.post(url, data=data, headers=headers)
json_data = r.json()

base = json_data["result"]["employees"][0] # First employee.

name = base["name"]
first_name = name["first"]
last_name = name["last"]

pay = base["pay"]["total"]

title = base["title"]
dept = base["department"]

print first_name, last_name, pay, title, dept
# Your turn here...

Result:

Clayton Abajian 9844 Lecturer - Academic Year CSU Sacramento
[Finished in 0.9s]
WGS
  • 13,969
  • 4
  • 48
  • 51
  • Many thanks! May I know what 'data' is in your code? – jinlong Apr 08 '14 at 23:08
  • Kindly mark the answer as accepted if it helped you. Anyway, data is simply the POST parameter I got from inspecting via Firebug. – WGS Apr 08 '14 at 23:34
  • Thanks a lot, Nanashi! It worked after some poking around.. I didn't figure out which parameter you used for "data" so I skip the data parameter for the rq.post, but it still works. – jinlong Apr 09 '14 at 06:45