19

I am downloading HTML pages that have data defined in them in the following way:

... <script type= "text/javascript">    window.blog.data = {"activity":{"type":"read"}}; </script> ...

I would like to extract the JSON object defined in 'window.blog.data'. Is there a simpler way than parsing it manually? (I am looking into Beautiful Soap but can't seem to find a method that will return the exact object without parsing)

Thanks

Edit: Would it be possible and more correct to do this with a python headless browser (e.g., Ghost.py)?

user971956
  • 3,088
  • 7
  • 30
  • 47
  • If you can get to the point where you can split on the `=`, you can parse the json into a python object as shown below >>> import json >>> x = '{"a":{"b": "c"}}' >>> type(x) >>> y = json.loads(x) >>> y {u'a': {u'b': u'c'}} >>> type(y) – Pratik Mandrekar Nov 10 '12 at 17:22
  • The problem with the parsing is finding the end point... because I am not sure the would come right after the json closer. – user971956 Nov 10 '12 at 19:03
  • 1
    How robust of a solution are you looking for? A relatively simple (though somewhat computationally taxing) approach would be to load up a Selenium driver, which will handle all the parsing for you, and have it return the variable's value. – cheeken Nov 10 '12 at 19:46
  • @cheeken Does Selenium driver has a local library? I'd rather not be dependent on an online API... – user971956 Nov 10 '12 at 20:53

4 Answers4

16

BeautifulSoup is an html parser; you also need a javascript parser here. btw, some javascript object literals are not valid json (though in your example the literal is also a valid json object).

In simple cases you could:

  1. extract <script>'s text using an html parser
  2. assume that window.blog... is a single line or there is no ';' inside the object and extract the javascript object literal using simple string manipulations or a regex
  3. assume that the string is a valid json and parse it using json module

Example:

#!/usr/bin/env python
html = """<!doctype html>
<title>extract javascript object as json</title>
<script>
// ..
window.blog.data = {"activity":{"type":"read"}};
// ..
</script>
<p>some other html here
"""
import json
import re
from bs4 import BeautifulSoup  # $ pip install beautifulsoup4
soup = BeautifulSoup(html)
script = soup.find('script', text=re.compile('window\.blog\.data'))
json_text = re.search(r'^\s*window\.blog\.data\s*=\s*({.*?})\s*;\s*$',
                      script.string, flags=re.DOTALL | re.MULTILINE).group(1)
data = json.loads(json_text)
assert data['activity']['type'] == 'read'

If the assumptions are incorrect then the code fails.

To relax the second assumption, a javascript parser could be used instead of a regex e.g., slimit (suggested by @approximatenumber):

from slimit import ast  # $ pip install slimit
from slimit.parser import Parser as JavascriptParser
from slimit.visitors import nodevisitor

soup = BeautifulSoup(html, 'html.parser')
tree = JavascriptParser().parse(soup.script.string)
obj = next(node.right for node in nodevisitor.visit(tree)
           if (isinstance(node, ast.Assign) and
               node.left.to_ecma() == 'window.blog.data'))
# HACK: easy way to parse the javascript object literal
data = json.loads(obj.to_ecma())  # NOTE: json format may be slightly different
assert data['activity']['type'] == 'read'

There is no need to treat the object literal (obj) as a json object. To get the necessary info, obj can be visited recursively like other ast nodes. It would allow to support arbitrary javascript code (that can be parsed by slimit).

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • thank you, this solved a problem that had me stumped - isolating json inside a script in a html page. I replaced 'window\.blog\.data' with my term that preceded the json, but the json_text assignment on the next line threw an error, saying object wouldn't take string functions (script.string). So I modified script = str(soup.find(...)) and the whole script was one long string. Then it was easy to use find to get the position of the phrases before and after the json, slice it out, and read it as a json. – Brian L Cartwright Jul 27 '14 at 20:38
  • @BrianLCartwright: the code works as is. You don't need to change it. It should work without `str(soup.find..`. If you want to find out what happens in your case [ask]. – jfs Jul 27 '14 at 20:52
  • Hey @J.F. Sebastian, What if I have ``? Here the case is not to get an object defined in some variable but to get an object used as an argument inside script tags. Note, case like this can be found where the websites use the javascript to load the web contents via json dynamically during page load. – user79307 Nov 03 '14 at 09:03
  • @user79307: It is enough to change the regex that extracts `json_text` above, if you don't know how, [ask a new question](http://stackoverflow.com/questions/ask) – jfs Dec 01 '14 at 17:30
  • I used an alternate more readable `json_text = script.text.split('=',1)[1].rstrip(';').strip()` line to split based on first `=` and remove last semicolon. I also had to remove some html apostophes with semicolons in them `json_text = json_text.replace(''','')` before sending to json – user1071182 Mar 09 '16 at 22:31
  • @user1071182: 1- `split('=')` may split on the wrong `=` (there could be multiple `=` signs in the script) 2- `BeautifulSoup` should decode already (where appropriate) the html entities such as `'`. If it doesn't then you shouldn't do it blindly: it depends on your application whether you should decode or not (`data` can be arbitrary json text). 3- if you think you need to decode the html entities then [use `import html; s = html.unescape(''')` instead](http://stackoverflow.com/q/2087370/4279) – jfs Mar 09 '16 at 22:47
  • @JF Sebastian I used `split('=',1)` to only split once on the first `=`. I'm assuming the data is in the format `window.blog.data = ` and we've already searched for the variable name. I ended up using `json_text = re.sub(r'\d+;','',json_text)` to eliminate all the special chars, but it's good to know the proper way to do it :) – user1071182 Mar 10 '16 at 07:31
  • @user1071182 :1- it is incorrect to assume that the first `=` sign in the script is the correct one. 2- again, `script.string` is a JavaScript code and therefore you shouldn't need to remove html entities. Make sure, you're not corrupting your json data. – jfs Mar 10 '16 at 08:59
  • @user1071182: I've updated the answer to use a javascript parser, to extract the javascript object literal. It makes the regex hacks unnecessary. – jfs Mar 10 '16 at 15:44
  • @JF Sebastion you're 100% right. It's much easier to see that when it's not 3am :) I'll update my dryscrape answer at some point to correct these errors – user1071182 Mar 10 '16 at 23:01
7

Something like this may work:

import re

HTML = """ 
<html>
    <head>
    ...
    <script type= "text/javascript"> 
window.blog.data = {"activity":
    {"type":"read"}
    };
    ...
    </script> 
    </head>
    <body>
    ...
    </body>
    </html>
"""

JSON = re.compile('window.blog.data = ({.*?});', re.DOTALL)

matches = JSON.search(HTML)

print matches.group(1)
Christian Thieme
  • 1,114
  • 6
  • 6
1

I had a similar issue and ended up using selenium with phantomjs. It's a little hacky and I couldn't quite figure out the correct wait until method, but the implicit wait seems to work fine so far for me.

from selenium import webdriver
import json
import re

url = "http..."
driver = webdriver.PhantomJS(service_args=['--load-images=no'])
driver.set_window_size(1120, 550)
driver.get(url)
driver.implicitly_wait(1)
script_text = re.search(r'window\.blog\.data\s*=.*<\/script>', driver.page_source).group(0)

# split text based on first equal sign and remove trailing script tag and semicolon
json_text = script_text.split('=',1)[1].rstrip('</script>').strip().rstrip(';').strip()
# only care about first piece of json
json_text = json_text.split("};")[0] + "}"
data = json.loads(json_text)

driver.quit()

```

user1071182
  • 1,609
  • 3
  • 20
  • 28
-1

fast and easy way is ('here put exactly the start (.*?) and the end here') that's all !

import re
import json
html = """<!doctype html>
<title>extract javascript object as json</title>
<script>
// ..
window.blog.data = {"activity":{"type":"read"}};
// ..
</script>
<p>some other html here
"""

than simply

re.search('{"activity":{"type":"(.*?)"', html).group(1)

or for full json

jsondata = re.search('window.blog.data = (.*?);', html).group(1)
jsondata = json.loads(jsondata)
print(jsondata["activity"])

#output {'type': 'read'}

Amine Rizk
  • 87
  • 1
  • 2
  • This solution is already covered in [this existing answer from 2012](https://stackoverflow.com/a/13325429/5320906). When answering old questions, please ensure that you're answer provides a distinct and valuable contribution to the Q&A. – snakecharmerb Aug 17 '21 at 05:49