How to scrape charts from a website with python?

Question

EDIT:

So I have save the script codes below to a text file but using re to extract the data still doesn't return me anything. My code is:

file_object = open('source_test_script.txt', mode="r")
soup = BeautifulSoup(file_object, "html.parser")
pattern = re.compile(r"^var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);$", re.MULTILINE | re.DOTALL)
scripts = soup.find("script", text=pattern)
profile_text = pattern.search(scripts.text).group(1)
profile = json.loads(profile_text)

print profile["data"], profile["categories"]

I would like to extract the chart's data from a website. The following is the source code of the chart.

  <script type="text/javascript">
    jQuery(function() {

    var chart1 = new Highcharts.Chart({

          chart: {
             renderTo: 'chart1',
              defaultSeriesType: 'column',
            borderWidth: 2
          },
          title: {
             text: 'Productions'
          },
          legend: {
            enabled: false
          },
          xAxis: [{
             categories: [1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016],

          }],
          yAxis: {
             min: 0,
             title: {
             text: 'Productions'
          }
          },

          series: [{
               name: 'Productions',
               data: [1,1,0,1,6,4,9,15,15,19,24,18,53,42,54,53,61,36]
               }]
       });
    });

    </script>

There are several charts like that from the website, called "chart1", "chart2", etc. I would like to extract the following data: the categories line and the data line, for each chart:

categories: [1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016]

data: [1,1,0,1,6,4,9,15,15,19,24,18,53,42,54,53,61,36]

I believe you could use selenium for something like that, ex: http://stackoverflow.com/questions/10455130/can-selenium-web-driver-have-access-to-javascript-global-variables — CasualDemon, Oct 05 '16 at 03:15
Yeah I'm using selenium to parse the html content. My code is: [code] req=urllib2.Request(productions_url, headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0'}) p=urllib2.urlopen(req) soup=BeautifulSoup(p.readlines()[0], 'html.parser')[/code]. My question is once I parse the html, how to extract those 2 particular lines. — Ilumtics, Oct 05 '16 at 03:25
HTML parser wont help you, because that is JavaScript. So, you have to parse it yourself. — zvone, Oct 05 '16 at 05:48

Anders Dahl · Answer 1 · 2017-07-29T03:22:01.887

9

Another way is to use Highcharts' JavaScript Library as one would in the console and pull that using Selenium.

import time
from selenium import webdriver

website = ""

driver = webdriver.Firefox()
driver.get(website)
time.sleep(5)

temp = driver.execute_script('return window.Highcharts.charts[0]'
                             '.series[0].options.data')
data = [item[1] for item in temp]
print(data)

Depending on what chart and series you are trying to pull your case might be slightly different.

edited Jul 29 '17 at 03:22

answered Jul 29 '17 at 03:13

Anders Dahl

91
1
6

1

This should be the accepted answer! Much simpler and more intuitive. – ahlexander Oct 15 '17 at 15:03

Tris Forster · Answer 2 · 2016-10-06T05:36:24.980

I'd go a combination of regex and yaml parser. Quick and dirty below - you may need to tweek the regex but it works with example:

import re
import sys
import yaml

chart_matcher = re.compile(r'^var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);$',
        re.MULTILINE | re.DOTALL)

script = sys.stdin.read()

m = chart_matcher.findall(script)

for name, data in m:
    print name
    try:
        chart = yaml.safe_load(data)
        print "categories:", chart['xAxis'][0]['categories']
        print "data:", chart['series'][0]['data']
    except Exception, e:
        print e

Requires the yaml library (pip install PyYAML) and you should use BeautifulSoup to extract the correct <script> tag before passing it to the regex.

EDIT - full example

Sorry I didn't make myself clear. You use BeautifulSoup to parse the HTML and extract the <script> elements, and then use PyYAML to parse the javascript object declaration. You can't use the built in json library because its not valid JSON but plain javascript object declarations (ie with no functions) are a subset of YAML.

from bs4 import BeautifulSoup
import yaml
import re

file_object = open('source_test_script.txt', mode="r")
soup = BeautifulSoup(file_object, "html.parser")

pattern = re.compile(r"var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);", re.MULTILINE | re.DOTALL | re.UNICODE)

charts = {}

# find every <script> tag in the source using beautifulsoup
for tag in soup.find_all('script'):

    # tabs are special in yaml so remove them first
    script = tag.text.replace('\t', '')

    # find each object declaration
    for name, obj_declaration in pattern.findall(script):
        try:
            # parse the javascript declaration
            charts[name] = yaml.safe_load(obj_declaration)
        except Exception, e:
            print "Failed to parse {0}: {1}".format(name, e)

# extract the data you want
for name in charts:
    print "## {0} ##".format(name);
    print "categories:", charts[name]['xAxis'][0]['categories']
    print "data:", charts[name]['series'][0]['data']
    print

Output:

## chart1 ##
categories: [1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016]
data: [22, 1, 0, 1, 6, 4, 9, 15, 15, 19, 24, 18, 53, 42, 54, 53, 61, 36]

Note I had to tweek the regex to make it handle the unicode output and whitespace from BeautifulSoup - in my original example I just piped your source directly to the regex.

EDIT 2 - no yaml

Given that the javascript looks to be partially generated the best you can hope for is to grab the lines - not elegant but will probably work for you.

from bs4 import BeautifulSoup
import json
import re

file_object = open('citec.repec.org_p_c_pcl20.html', mode="r")
soup = BeautifulSoup(file_object, "html.parser")

pattern = re.compile(r"var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);", re.MULTILINE | re.DOTALL | re.UNICODE)

charts = {}

for tag in soup.find_all('script'):

    # tabs are special in yaml so remove them first
    script = tag.text

    values = {}

    # find each object declaration
    for name, obj_declaration in pattern.findall(script):
        for line in obj_declaration.split('\n'):
            line = line.strip('\t\n ,;')
            for field in ('data', 'categories'):
                if line.startswith(field + ":"):
                    data = line[len(field)+1:]
                    try:
                        values[field] = json.loads(data)
                    except:
                        print "Failed to parse %r for %s" % (data, name)

        charts[name] = values

print charts

Note that it fails for chart7 because that references another variable.

So I have save the script codes below to a text file but using re to extract the data still doesn't return me anything. My code is: file_object = open('source_test_script.txt', mode="r") soup = BeautifulSoup(file_object, "html.parser") pattern = re.compile(r"^var (chart[0-9]+) = new Highcharts.Chart$({.*?})$;$", re.MULTILINE | re.DOTALL) scripts = soup.find("script", text=pattern) profile_text = pattern.search(scripts.text).group(1) profile = json.loads(profile_text) print profile["data"], profile["categories"] — Ilumtics, Oct 05 '16 at 23:07
I tried the code as you suggested but kept getting this: "Failed to parse chart1: while parsing a flow mapping in "", line 29, column 16: tooltip: { ^ expected ',' or '}', but got '{'" — Ilumtics, Oct 06 '16 at 03:33
You may still want to use `yaml.safe_load` instead of `json.loads` as it is more forgiving on bad input (chart3 for example has trailing commas in the arrays) — Tris Forster, Oct 06 '16 at 05:46
The json.loads code works now but the yaml code still gives me the same error ... — Ilumtics, Oct 06 '16 at 05:46

How to scrape charts from a website with python?

2 Answers2

Linked