0

I'm trying to scrape a lot of data from the page provided below and when I inspect on a browser I see a path but, when I'm using BeautifulSoup I can't get at this data. For example I'm after the city Beijing with the path below but I'd get a None. When I print soup I can see that the html is formatted very differently (js I believe) and beautifulsoup can't handle it, so what would be the alternative for me to be able to pull data from that section. Thanks.

from bs4 import BeautifulSoup,Tag
import urllib2
hdr = {'Accept': 'text/html,application/xhtml+xml,*/*',"user-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36"}
url='https://www.upwork.com/freelancers/_~013dfabae39ba01678/'
req=urllib2.Request(url,headers=hdr)
html = urllib2.urlopen(req)
soup=BeautifulSoup(html,"lxml")
#when I inspect I see a path as follows, however printing the soup shows a txt/javascript..
locality=soup.find('span',{'itemprop':'locality'})

In the middle of the beautifulsoup output you find all the data of interest in this snippit of var phpVars:

<script type="text/javascript">
    // global Applet object
    var Applet = new function() {
        var basePath = '/freelancers';
        var phpVars = {"urchinId":"UA-62227314-1","csrfTokenCookieName":"XSRF-TOKEN","csrfTokenHeaderName":"X-Odesk-Csrf-Token","runtime_id":"0128305700c7dc55bb8","clientStatsDMetrics":true,"smfAjax":false,"userId":"424358860525125632","isVisitor":true,
FancyDolphin
  • 459
  • 1
  • 7
  • 25

1 Answers1

0

You can try this :

from bs4 import BeautifulSoup
import urllib2
import re
import json

hdr = {'Accept': 'text/html,application/xhtml+xml,*/*',"user-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36"}
url='https://www.upwork.com/freelancers/_~013dfabae39ba01678/'
req=urllib2.Request(url,headers=hdr)
html = urllib2.urlopen(req)
soup=BeautifulSoup(html,"lxml")

script = soup.title.find_next('script').get_text()
map_search = re.search('.*var phpVars = (\{.*);', script)
mapData = map_search.group(1)
mapDataObj = json.loads(mapData)

print mapDataObj['profile']['profile']['location']['city']

It search first script after title and extract content.

Datas you're interested in are in json format, we have to extract the json part from this script using regular expressions and parse it with the python json module.

You can finally access datas through a dict named mapDataObj.

SLePort
  • 15,211
  • 3
  • 34
  • 44
  • ah silly me!!!!!! I didn't think of loading it as a json!!! Thank you. Just for completeness to fully answer the question I gave, the line should've been print mapDataObj['profile']['profile']['location']['city'] – FancyDolphin Apr 03 '16 at 08:49
  • Updated my answer. – SLePort Apr 03 '16 at 08:50