BS4 parses HTML, so you can use it to find the contents of this script
tag. For example, if this is the only script in you page:
script = soup.script.text
But it doesn't parse JavaScript.
So, you have a few choices:
- Download (or write) a JavaScript interpreter, use it to execute the code in the script, and then inspect it to see the variable and value it placed into the JS globals.
- Download (or write) a JavaScript parser, then scan the nodes for the
var
statement you're looking for and extract and interpret its value.
- Write a parser for a limited subset of JavaScript that will handle this specific case, but raise a noisy exception if they later rewrite their page to do something completely different in the
script
tag.
Which one you want to do depends on what you're trying to accomplish. But I suspect the last one is what you actually want here. In which case there isn't going to be anything off-the-shelf that does all the work for you. But you can cheat.
You don't actually need the whole TralbumData
value, just the album_release_date
member of it. So the grammar can be as simple as this regex:
r'album_release_date: \"(.*?)\"'
So:
script = soup.script.text
reldatematch = re.search(r'album_release_date: \"(.*?)\"', script)
if reldatematch:
reldate = your_date_parser_func(reldatematch.group(1))
Whether you want to make this more robust is up to you.
If, say, you want to verify that this is actually the TralbumData.album_release_date
value, not just something that happens to match album_release_date
, then the grammar you want is just var TralbumData = OBJECT_LITERAL
, and that OBJECT_LITERAL
is almost JSON, except that it has bare keys. The first part, you could handle with just string methods:
empty, lead, literal = script.partition('var TralbumData = ')
if empty or not lead:
raise SomeException
And for parsing the literal
, you could adapt a JSON parser like the JSON example for pyparsing
or the stdlib's json
module. Or, alternatively, you could do something hacky like pre-quoting all the keys and then just json.loads
it.