4

I want to scrape a block of data from a series of pages that have the data tucked away in a JSON object inside of a script tag. I'm fairly comfortable with BeautifulSoup, but I think I might be barking up the wrong tree trying to use it to get data from JavaScript.

The structure of the pages is, roughly, this:

...
<script>
  $(document).ready(function(){
    var data = $.data(graph_selector, [
         { data: charts.createData("Stuff I want")}
    ])};
</script>

The head and body have a zillion scripts each, but there's only one var data per page. I'm not sure how I'd identify this particular <script> for BeautifulSoup except by the presence of var data

Can I do this? Or do I need another tool?

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Amanda
  • 12,099
  • 17
  • 63
  • 91
  • I would find the raw javascript string using BeautifulSoup and then use regex to get the 'Stuff I want' [Somewhat related](http://stackoverflow.com/a/21069605/1189040) or [something like this](http://stackoverflow.com/a/21069526/1189040) – Himal Nov 27 '14 at 03:59
  • But then the value of BeautifulSoup in the equation is fairly low. It allows you to find the ` – tripleee Nov 27 '14 at 04:15

1 Answers1

2

BeautifulSoup is an HTML parser, it cannot parse javascript code.

Here are the options you have:

  1. use a javascript parser like slimit

    from bs4 import BeautifulSoup
    from slimit import ast
    from slimit.parser import Parser
    from slimit.visitors import nodevisitor
    
    data = """
    <script>
        var data = $.data(graph_selector, [
             { data: charts.createData("Stuff I want")}
        ]);
    </script>
    """
    
    soup = BeautifulSoup(data)
    script = soup.find('script')
    
    
    parser = Parser()
    tree = parser.parse(script.text)
    print next(node.args[0].value for node in nodevisitor.visit(tree)
               if isinstance(node, ast.FunctionCall) and node.identifier.identifier.value == 'createData')
    # prints "Stuff I want"
    

    Note that I had to cut down the script for the sake of a working example and due to parsing errors. Might not work for your real script contents, please check.

  2. use regular expressions (the easiest option yet unreliable so don't use it in production code unless you have control over the JS code too and can make the guarantees needed):

    import re
    from bs4 import BeautifulSoup
    
    data = """
    <script>
    $(document).ready(function() {
    var data = $.data(graph_selector, [{data: charts.createData("Stuff I want")}])};
    </script>
    """
    
    soup = BeautifulSoup(data)
    script = soup.find('script')
    
    pattern = r'charts.createData\("(.*?)"\)'
    print re.search(pattern, script.text).group(1)  # prints "Stuff I want"
    
  3. let smth execute the javascript code: selenium (real browser), or V8, or PyExecJS

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • @ivan_pozdeev thanks for the important notes. This is why I've put this option under a second position, also added a working example for the first option. – alecxe Nov 27 '14 at 04:46