0

I am trying to crawl a webpage with a chart by my first spider.

When I am trying to use chrome locate XPath of this chart, it gives me something like this:

//*[@id="highcharts-jji61bd-2"]/svg/g[4]/g[1]/rect[1]

I also try this after some search,

//*[@id="highcharts-jji61bd-2"]/[name()='svg']

nothing return. An example would be "Age in Ultimo" in this webpage:

http://suburbdata.com.au/Sydney/Ultimo

but when I check the whole return of response, there is no chart. I can only find a container div element:

<div id="chart_age_distribution" class="details_chart" style="width: 
250px; height: 200px; margin: 0 auto">

I think this chart is created on client side, but I don't know how I can do the simulation to create it.

Any idea would help. Thanks

Tomáš Linhart
  • 9,832
  • 1
  • 27
  • 39
dawenzi098
  • 35
  • 1
  • 7
  • You cannot get that chart with scrapy as it's not present in initial page source, but generated dynamically. [This might be helpful](https://stackoverflow.com/questions/30345623/scraping-dynamic-content-using-python-scrapy) – Andersson Jan 18 '18 at 08:59

1 Answers1

0

If you look in the page source, you'll see that the charts are generated on the frontend using Highcharts JavaScript library.

The charts' data are in the page source, however, and with a little effort can be extracted out of it. Look at this solution employing j2xml library (just the relevant part that should go into parse method):

import re
import js2xml

def parse(self, response):
    # ...
    js_source = re.search(r'<script>([^<]*?#chart_[^<]*?\.highcharts[^<]*?)</script>', response.body, flags=re.DOTALL).group(1)
    parsed = js2xml.parse(js_source)

    for chart in parsed.xpath('//body//functioncall'):
        categories = chart.xpath('.//property[@name="categories"]/array/string/text()')
        values = chart.xpath('.//property[@name="data"]/array/number/@value')
        data = zip(categories, values)
    # ...
Tomáš Linhart
  • 9,832
  • 1
  • 27
  • 39
  • Thanks, your solution might be the better one, but what I found is the ID of SVG element(parents of SVG element) might be generated by js or something else which will be different every time. That's the reason my scrapy spider cannot crawl anything. I can easily solve this by locate the parents of that element which the ID won't change, then follow index path to get the data I want. (Might only work for my case) – dawenzi098 Jan 19 '18 at 02:27
  • also change scrapy.Request to SplashRequest. – dawenzi098 Jan 19 '18 at 02:42