0

I am scraping the financial data from below link using Scrapy:

Financial Data from Tencent

The reponse.body is like below:

Response.body

I have tried to split the response using regular regression then convert it to json but it shows no json object, here is my code:

import scrapy
import re
import json

class StocksSpider(scrapy.Spider):
    name = 'stocks'
    allowed_domains = ['web.ifzq.gtimg.cn']
    start_urls = ['http://web.ifzq.gtimg.cn/appstock/hk/HkInfo/getFinReport?type=3&reporttime_type=1&code=00001&startyear=1990&endyear=2016&_callback=jQuery11240339550$']

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url=url, callback=self.parse,
            #endpoint='render.json', # optional; default is render.html
            #splash_url='<url>',     # optional; overrides SPLASH_URL
            #slot_policy=scrapy_splash.SlotPolicy.PER_DOMAIN,  # optional
            )

    def parse(self, response):
        try:
            json_data = re.search('\{\"data\"\:(.+?)\}\}\]', response.text).group(1)
        except AttributeError:
            json_data = ''
        #print json_data
        loaded_json = json.loads(json_data)

        print loaded_json


It throws an error saying that no json object can be decoded:

    Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/usr/local/lib/python2.7/dist-packages/scrapy_splash/middleware.py", line 156, in process_spider_output
    for el in result:
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output
    for x in result:
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/root/finance/finance/spiders/stocks.py", line 25, in parse
    loaded_json = json.loads(json_data)
  File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
2018-06-09 23:54:26 [scrapy.core.engine] INFO: Closing spider (finished)

My goal is to convert it to json so that I can easily iterate the content. Is it necessary to convert it to json and how to convert in this case? The response is in unicode format so that I need to convert it to utf-8 as well? Is there any other good way to do iteration?

Nicholas Kan
  • 161
  • 1
  • 3
  • 14
  • Is that last argument in the url (`&_callback=jQuery1124033955090772971586_1528569153921`) required? The data returned without it looks similar and is a valid json. – bla Jun 10 '18 at 00:32
  • I don't know, I am a newbie to Python and Scrapy. The link of the source data is found from this website: http://gu.qq.com/hk00001/gp/income I used Google Chrome->Inspect->Source to find that link – Nicholas Kan Jun 10 '18 at 00:41

3 Answers3

1

As bla said without &_callback=jQuery1124033955090772971586_1528569153921 the data is vaild json, callback is not required also is not static, for example http://web.ifzq.gtimg.cn/appstock/hk/HkInfo/getFinReport?type=3&reporttime_type=1&code=00001&startyear=1990&endyear=2016&_callback=test is gives the same results

ergesto
  • 367
  • 1
  • 8
1

The problem seems to be that the actual data is inside jQuery1124033955090772971586_1528569153921(). I was able to get rid of it by removing a parameter in the request url. If you absolutely needs it, this may do the trick:

>>> import json
>>> url = 'http://web.ifzq.gtimg.cn/appstock/hk/HkInfo/getFinReport?type=3&reporttime_type=1&code=00001&startyear=1990&endyear=2016&_callback=jQuery1124033955090772971586_1528569153921&_=1528569153953'
>>> fetch(url)
2018-06-09 21:55:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://web.ifzq.gtimg.cn/appstock/hk/HkInfo/getFinReport?type=3&reporttime_type=1&code=00001&startyear=1990&endyear=2016&_callback=jQuery1124033955090772971586_1528569153921&_=1528569153953> (referer: None)
>>> data = response.text.strip('jQuery1124033955090772971586_1528569153921()')
>>> parsed_data = json.loads(data)

If you prefer to remove the _callback parameter from the url, simply:

>>> import json
>>> url = 'http://web.ifzq.gtimg.cn/appstock/hk/HkInfo/getFinReport?type=3&reporttime_type=1&code=00001&startyear=1990&endyear=2016&_=1528569153953'
>>> fetch(url)
2018-06-09 21:53:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://web.ifzq.gtimg.cn/appstock/hk/HkInfo/getFinReport?type=3&reporttime_type=1&code=00001&startyear=1990&endyear=2016&_=1528569153953> (referer: None)
>>> parsed_data = json.loads(response.text)
bla
  • 1,840
  • 1
  • 13
  • 17
  • Hi bla, both methods are OK for me, thanks for your answer, now come to second part of my question, the scrapy response is in unicode format, hence the print result: {u'msg': u'', u'code': 0, u'data': {u'data': [{u'fd_administration_fee_ratio': u'--', u'fd_repdate_ratio': u'0.05', u'fd_profit_after_share_ratio': u'--', u'fd_stock_dividend': u'10340.00', u'fd_administration_fee': u'0.00', u'fd_depreciation': u'-13262.00', u'fd_dividend_base_share': u'2.68', u'fd_profit_before_tax_ratio': u'-1.69...............Do you know how to convert it to a native json by removing the 'u'? – Nicholas Kan Jun 10 '18 at 01:13
  • You are welcome, :). You should not have problems dealing with strings beginning with `u` (check out [this response](https://stackoverflow.com/questions/11279331/what-does-the-u-symbol-mean-in-front-of-string-values#11279428) for some more on that). If you really want to get rid of it try parsing `response.tex.encode('utf8')` (docs [here](https://docs.python.org/2.7/library/stdtypes.html?highlight=str%20encode#str.encode)). If possible try using python3 as well. :) – bla Jun 10 '18 at 01:24
  • Hi bla, I found that your answer does not work, while compared to the original data. In the original data the record should start from 'fd_year', however, after changed to your code, it starts from administration fee', would you please take a look into this? – Nicholas Kan Jun 10 '18 at 01:31
  • `json.loads` keeps returning unicode encoded strings even when the source is not one. The simplest solution then would be creating your own dict given the one returned by the parser: `{k.encode('utf8'):v.encode('utf8') for k, v in json.loads(data).items()}` – bla Jun 10 '18 at 01:31
  • I am sorry to hear that. Can provide some more information about what went wrong? – bla Jun 10 '18 at 01:33
  • [`dicts`](https://docs.python.org/2.7/library/stdtypes.html#dict) in python do not guarantee order. So you can't rely on the order when you are iterating over it. You may try sorting a list with the keys in any given order and them iterating over it. Alternatively you may try using [`collections.OrderedDict`](https://docs.python.org/2.7/library/collections.html?highlight=collections%20ordereddict#collections.OrderedDict), which by default keep items ordered by insertion order. – bla Jun 10 '18 at 01:43
  • I checked it by eyeball, yes it seems OK, just in different order. It is very strange, I exported the response into json format, then I opened the json file, it does not contain the u' {"msg": "", "code": 0, "data": {"data": [{"fd_administration_fee_ratio": "--", "fd_repdate_ratio": "0.05", "fd_profit_after_share_ratio": "--", "fd_stock_dividend": " – Nicholas Kan Jun 10 '18 at 02:38
  • There is missing data after exporting? This is very strange indeed. How did you export it? Maybe there is something going on there. – bla Jun 10 '18 at 02:45
  • The data is complete, but just don't know why there is a u' in every elements. – Nicholas Kan Jun 10 '18 at 11:43
  • The `u` prefix indicates that those are [unicode strings](https://docs.python.org/2/tutorial/introduction.html#unicode-strings). – bla Jun 10 '18 at 17:00
  • I have read several post from the internet and seems that the `u` does not affect usage, however I still don't understand why I cannot iterate the list then post to MySQL server, could you help me to check my code? I can send you by e-mail? – Nicholas Kan Jun 11 '18 at 15:47
  • I have no experience with mysql, but feel free to send. I will be able to see it as soon as I get home. – bla Jun 11 '18 at 16:02
  • Just sent to you by e-mail now – Nicholas Kan Jun 12 '18 at 13:32
  • I am at work right now, but I will take a look as soon as I get home. – bla Jun 12 '18 at 13:33
1
import re
import scrapy


class StocksSpider(scrapy.Spider):
    name = 'stocks'
    allowed_domains = ['gtimg.cn']
    start_urls = ['http://web.ifzq.gtimg.cn/appstock/hk/HkInfo/getFinReport?type=3&reporttime_type=1&code=00001&startyear=1990&endyear=2016&_callback=jQuery1124033955090772971586_1528569153921&_=1528569153953']

    def parse(self, response):
        try:
            json = eval(re.findall(r'jQuery\d+_\d+(\(\{.+\}\))', response.body)[0])
            print json
        except:
            self.log('Response couldn\'t be parsed, seems like it is having different format')

instead of converting in json use eval because at the end you're going to use it as a dict of lists and etc

may be like,

import re
import scrapy


class StocksSpider(scrapy.Spider):
    name = 'stocks'
    allowed_domains = ['gtimg.cn']
    start_urls = ['http://web.ifzq.gtimg.cn/appstock/hk/HkInfo/getFinReport?type=3&reporttime_type=1&code=00001&startyear=1990&endyear=2016&_callback=jQuery1124033955090772971586_1528569153921&_=1528569153953']

    def parse(self, response):
        data = eval(re.findall(r'jQuery\d+_\d+(\(\{.+\}\))', response.body)[0])
        items = data.get('data', {}).get('data', [])

        for item in items:
            yield item

or may be you can use json load instead eval it is also fine

Yash Pokar
  • 4,939
  • 1
  • 12
  • 25
  • Thanks a lot for your help Yash, the first method can generate pretty neat 'json-like' response. The second method, generate the extracted values and it is stored neatly if it is exported as csv, however a few rows are stored in json format. I spent the afternoon to test iteration and push the data to MySQL server but no luck, not sure if it is the problem of my code or the json extracted from Scrapy. – Nicholas Kan Jun 10 '18 at 11:52
  • @NicholasKan anything else where i can help you? – Yash Pokar Jun 10 '18 at 12:01
  • If you could help to test the my code to see why I cannot iterate through the json generated/why it is not posted to MySQL server? – Nicholas Kan Jun 10 '18 at 12:04
  • Can I send you by e-mail? – Nicholas Kan Jun 10 '18 at 12:11
  • Sent just now, please check your e-mail – Nicholas Kan Jun 10 '18 at 12:31
  • Just sent the IP address to you, pls check again – Nicholas Kan Jun 10 '18 at 12:35