Web scraping unknown data structure (JSON, nested list, or something else?)

Question

I built a web scraper for this page that hinged on parsing a string as JSON file. But they've made some updates to the site and now the scraper has stopped working. I think the issue is that the information I need is no longer structured as JSON.

Here's what I had originally:

# Packages
from bs4 import BeautifulSoup
from urllib.request import urlopen, urlretrieve
import json
import ast

# The part that still works
address = 'https://campus.datacamp.com/courses/intro-to-python-for-data-science/chapter-1-python-basics?ex=2' 
html = urlopen(address)
soup = BeautifulSoup(html, 'lxml')
string = soup.find_all('script')[2].string
json_text = string.strip('window.PRELOADED_STATE = "')[:-2]

# The part that's now broken
lesson = json.loads(json_text)

#> Traceback (most recent call last):
#> <ipython-input-11-f9b7d249d994> in <module>()
#>       2 # The part that's now broken
#>       3 
#> ----> 4 lesson = json.loads(json_text)
#> ~/anaconda3/lib/python3.6/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
#>     352             parse_int is None and parse_float is None and
#>     353             parse_constant is None and object_pairs_hook is None and not kw):
#> --> 354         return _default_decoder.decode(s)
#>     355     if cls is None:
#>     356         cls = JSONDecoder
#> ~/anaconda3/lib/python3.6/json/decoder.py in decode(self, s, _w)
#>     337 
#>     338         """
#> --> 339         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
#>     340         end = _w(s, end).end()
#>     341         if end != len(s):
#> ~/anaconda3/lib/python3.6/json/decoder.py in raw_decode(self, s, idx)
#>     355             obj, end = self.scan_once(s, idx)
#>     356         except StopIteration as err:
#> --> 357             raise JSONDecodeError("Expecting value", s, err.value) from None
#>     358         return obj, end
#> JSONDecodeError: Expecting value: line 1 column 2 (char 1)

The issue is that all the information in json_text is no longer structured as a JSON.

need_to_parse = BeautifulSoup(json_text, 'lxml').string #Escape HTML
print(len(need_to_parse))
#> 61453
print(need_to_parse[:50])
#> ["~#iM",["preFetchedData",["^0",["course",["^0",["
print(need_to_parse[-50:])
#> "type","MultipleChoiceExercise","id",14253]]]]]]]]

I thought maybe is was a nested list, so I tried ast.literal_eval(), but no luck!

parsed_list = ast.literal_eval(need_to_parse)
#> Traceback (most recent call last):
#>   File "/Users/nicholascifuentes-goodbody/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2862, in run_code
#>     exec(code_obj, self.user_global_ns, self.user_ns)
#>   File "<ipython-input-13-55b60da762d6>", line 2, in <module>
#>     parsed_list = ast.literal_eval(need_to_parse)
#>   File "/Users/nicholascifuentes-goodbody/anaconda3/lib/python3.6/ast.py", line 48, in literal_eval
#>     node_or_string = parse(node_or_string, mode='eval')
#>   File "/Users/nicholascifuentes-goodbody/anaconda3/lib/python3.6/ast.py", line 35, in parse
#>     return compile(source, filename, mode, PyCF_ONLY_AST)
#>   File "<unknown>", line 1
#>     ["~#iM",["preFetchedData"

The full output is in a txt file HERE.

Does anyone recognize this data structure? What's the best way to parse it?

Created on 2018-10-19 by the reprexpy package

import reprexpy
print(reprexpy.SessionInfo())
#> Session info --------------------------------------------------------------------
#> Platform: Darwin-17.7.0-x86_64-i386-64bit (64-bit)
#> Python: 3.6
#> Date: 2018-10-19
#> Packages ------------------------------------------------------------------------
#> beautifulsoup4==4.6.0
#> reprexpy==0.1.1

Can you make the complete string available. (See https://meta.stackexchange.com/questions/47689/how-can-i-attach-a-file-to-a-post for suggestions on how to do it. Don't paste 60000 chars here.) Curious as to why ast doesn't work. — Alain, Oct 19 '18 at 14:30
Thanks for the suggestion @Alain. I edited the question and added a DropBox link to the full string. Appreciate your help! — ncgoodbody, Oct 19 '18 at 16:10
If you're only interested in parsing THIS page, the issue is with double quotes that are escaped. Removing them allows you to load the string as json and access all the lists and inner lists. Executing `json_text = json_text.replace('\\\\\"', '')` will do it for you. This is certainly not a final solution as next week the page may contain other escaped characters, but this is a good starting point for you to understand what is happening and experiment with different solutions. — Alain, Oct 21 '18 at 17:48
Ah ha! Thanks for looking through the string. I'm trying what you're suggesting, but I can't get it to work. So you're doing `json_text = json_text.replace('\\\\\"', '')` and then `ast.literal_eval(json_text)` or `json.loads(json_text)`? — ncgoodbody, Oct 21 '18 at 22:59
I'm using `json.loads(json_text)`. In my test code I'm reading the original string from the file that you posted, so maybe the string you work with looks different before the writing/reading process. The first problematic sequence is around `not_printed_msg = \\"__JINJA__:Have you` and this is the first occurrence of JINJA. You can look for this section and verify that there are 2 backslashes before the double quote. — Alain, Oct 22 '18 at 12:35
Yes, it worked. Thank you! I'm curious, how did you figure out it was the escaped double quotes? I couldn't make heads or tails of the string. — ncgoodbody, Oct 23 '18 at 02:42
You're welcome. When calling `ast.literal_eval()` on the original string the error message contained two lines: the original string, then a line with a caret (^) sign under first blank following the first word after the first `\\n` occurrence. — Alain, Oct 23 '18 at 10:12

score 1 · Answer 1 · answered Oct 20 '18 at 07:42

The data structure is a Javascript array (of nested arrays), serialised to a string and with html entities escaped.

In your browser console, you can unescape it and call eval on the unescaped string to get the array.

For me, ast.literal_eval raises SyntaxError, so the string must contain Javascript elements which are not valid Python syntax. Even if it didn't, ast.literal_eval could still fail on Javascript elements that are syntactically valid Python but illegal values, for example null or objects with unquoted keys.

To parse it you need to shell out to a Javascript parser, or find a Python tool that parses Javascript - the answers to this question lists some, but note that it has been closed since 2014, so there may be newer solutions available.

Web scraping unknown data structure (JSON, nested list, or something else?)

1 Answers1

Linked