0

I need to extract the "album_release_date" value on a web page. It is listed in a script tag in what looks like a dictionary value on the page and I am not sure if it can be accessed with bs4. The only solution I can think of is using regex, would that be the best method?

Here is the structure of the page's source code:

<script type="text/javascript">

var TralbumData = {

    current: {
        "upc":null,"title":"BLACK&WHITE MEDICINE","purchase_title":null,
        "download_desc_id":null,"minimum_price":0.0,"set_price":7.0,"mod_date":"17 Jun 2018 11:47:50 GMT"
    },
    album_is_preorder: null,
    album_release_date: "17 Jun 2018 00:00:00 GMT",
}
</script>

I have searched all over, but could not find anything outside of accessing the dictionaries that are exported from bs4.

Aran-Fey
  • 39,665
  • 11
  • 104
  • 149
jakejake
  • 35
  • 5
  • 1
    Use the regex `current: {(.|\s)+?}` to get the `current` object as a string and then [parse it as JSON](https://stackoverflow.com/questions/7771011/parse-json-in-python) – GalAbra Jun 18 '18 at 17:37
  • @GalAbra `album_release_date` isn't a subkey of `current`, it's a top-level key. – abarnert Jun 18 '18 at 17:48

2 Answers2

1

BS4 parses HTML, so you can use it to find the contents of this script tag. For example, if this is the only script in you page:

script = soup.script.text

But it doesn't parse JavaScript.

So, you have a few choices:

  • Download (or write) a JavaScript interpreter, use it to execute the code in the script, and then inspect it to see the variable and value it placed into the JS globals.
  • Download (or write) a JavaScript parser, then scan the nodes for the var statement you're looking for and extract and interpret its value.
  • Write a parser for a limited subset of JavaScript that will handle this specific case, but raise a noisy exception if they later rewrite their page to do something completely different in the script tag.

Which one you want to do depends on what you're trying to accomplish. But I suspect the last one is what you actually want here. In which case there isn't going to be anything off-the-shelf that does all the work for you. But you can cheat.

You don't actually need the whole TralbumData value, just the album_release_date member of it. So the grammar can be as simple as this regex:

r'album_release_date: \"(.*?)\"'

So:

script = soup.script.text
reldatematch = re.search(r'album_release_date: \"(.*?)\"', script)
if reldatematch:
    reldate = your_date_parser_func(reldatematch.group(1))

Whether you want to make this more robust is up to you.

If, say, you want to verify that this is actually the TralbumData.album_release_date value, not just something that happens to match album_release_date, then the grammar you want is just var TralbumData = OBJECT_LITERAL, and that OBJECT_LITERAL is almost JSON, except that it has bare keys. The first part, you could handle with just string methods:

empty, lead, literal = script.partition('var TralbumData = ')
if empty or not lead:
    raise SomeException

And for parsing the literal, you could adapt a JSON parser like the JSON example for pyparsing or the stdlib's json module. Or, alternatively, you could do something hacky like pre-quoting all the keys and then just json.loads it.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • that makes a lot of sense, thanks for listing the options, really helpful for me. I will probably go with the regex option then. – jakejake Jun 18 '18 at 17:57
0

Extract script content and use regular expressions to extract the object

from bs4 import BeautifulSoup
import re


html = '''<script type="text/javascript">

        var TralbumData = {

            current: {
                "upc":null,"title":"BLACK&WHITE MEDICINE","purchase_title":null,
                "download_desc_id":null,"minimum_price":0.0,"set_price":7.0,"mod_date":"17 Jun 2018 11:47:50 GMT"
            },
            album_is_preorder: null,
            album_release_date: "17 Jun 2018 00:00:00 GMT",
        }
        </script>'''



pattern = re.compile(r'album_release_date: \"(.*?)\"', re.MULTILINE)
soup = BeautifulSoup(html, 'html.parser')
script = soup.script.text
release_date = re.search(pattern, script).group(1)
print (release_date)
ergesto
  • 367
  • 1
  • 8