0

I am trying to parse a JSON item of the following format:

DataStore.prime('ws-stage-stat', 
{ against: 0, field: 2, stageId: 9155, teamId: 26, type: 8 }, 
[[['goal','fastbreak','leftfoot',[1]],['goal','openplay','leftfoot',[2]], 
['goal','openplay','rightfoot',[1]],['goal','owngoal','leftfoot',[1]],
['goal','penalty','rightfoot',[1]],['miss','corner','header',[6]],
['miss','corner','leftfoot',[2]],['miss','corner','rightfoot',[2]],
['miss','crossedfreekick','header',[1]],['miss','openplay','header',[4]],
['miss','openplay','leftfoot',[11]],['miss','openplay','rightfoot',[27]]]]

The items in quotes represent a description of types of goals scored or chances missed that are listed on a website. The numbers represents the volume. I'm assuming that this is a JSON array of arrays with mixed text and numerical data. What I would like to do is break this down into python variables in the format of

var1 = "'goal','fastbreak','leftfoot'"
var2 = 1

...and repeat for all elements of the above pattern.

The code that is parsing this data structure is this:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time
import re
import json
import requests


class ExampleSpider(CrawlSpider):
    name = "goal2"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com/Teams/32/Statistics/England-Manchester-United"]
    download_delay = 5

    rules = [Rule(SgmlLinkExtractor(allow=('http://www.whoscored.com/Teams/32/Statistics/England-Manchester-United'),deny=('/News', '/Graphics', '/Articles', '/Live', '/Matches', '/Explanations', '/Glossary', 'ContactUs', 'TermsOfUse', 'Jobs', 'AboutUs', 'RSS'),), follow=False, callback='parse_item')]

    def parse_item(self, response):

url = 'http://www.whoscored.com/stagestatfeed'
        params = {
            'against': '0',
            'field': '2',
            'stageId': '9155',
            'teamId': '32',
            'type': '8'
            }
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36',
           'X-Requested-With': 'XMLHttpRequest',
           'Host': 'www.whoscored.com',
           'Referer': 'http://www.whoscored.com/'}

        responser = requests.get(url, params=params, headers=headers)

        print responser.text

I've checked the type of responser.text using print type(responser.text), which returns a result of 'unicode'. Does this mean that this object is now a set of nested Python lists? If so, how can I parse it so that it returns the data in the format that I am after?

Thanks

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
gdogg371
  • 3,879
  • 14
  • 63
  • 107
  • still scraping? Just use literal_eval like you used it before in your previous code. It works fine – Padraic Cunningham Sep 15 '14 at 22:56
  • @PadraicCunningham yes, still working at it. i've built most of it now. i'm just handling the xhr requests to finish it off. one of the pages in particular is a bit of a nightmare as there is virtually no embedded code and it is all xhr requests with tricky to parse responses sent back. – gdogg371 Sep 15 '14 at 23:07
  • @PadraicCunningham: `ast.literal_eval` isn't going to work on something of the format `Datastore.prime(…)`, it's going to raise an exception on the `Call` node. – abarnert Sep 15 '14 at 23:30
  • `DataStore.prime` is not returned when I run your code – Padraic Cunningham Sep 16 '14 at 18:09
  • @PadraicCunningham datastore.prime is embedded in the source code of the page and contains the same values as the XHR request as above and in the same structure. i used it as an example of the data structure more than anything. when you change the field value through 0,1,2 it repopulates the list of lists with different data. – gdogg371 Sep 16 '14 at 23:37

3 Answers3

2

That's not JSON. JSON doesn't allow single-quoted strings. It also doesn't have constructor calls like that. See the official grammar.

You really want to figure out what format you actually have, and parse it appropriately. Or, better, if you have any control over the output code, fix it to be something that's easy (and safe and efficient) to parse.

At any rate, this looks like a repr of a Python object (in particular, a Datastore.prime object being constructed with a string, a dict, and a list of lists of … as arguments). So, you probably could parse it with eval. Whether that's a good idea or not (possibly with some kind of sanitizing) depends on where you're getting the data from and what your security requirements are.

Or it could just as easily be JavaScript code. Or various other scripting languages. (Most of them have similar structures with similar syntax—which is exactly why they all map between JSON and native data so easily; JSON is basically a subset of the literals for most scripting languages.)

A slightly safer and saner solution would be to explicitly parse out the top level, then use ast.literal_eval to parse out the string, dict, and list components.

A possibly overly complicated solution would be to write a real custom parser.

But the best solution, again, would be to change the source to give you something more useful. Even if you really want to pass a Python object unsafely, pickle is a better idea than repr and eval. But most likely, that isn't what you actually want to do in the first place.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • thanks for replying. is datastore.prime a standard python object then? i tried reading up on this a while ago and was unable to locate anything that explained what it was. – gdogg371 Sep 15 '14 at 23:09
  • @user3045351: No, there's nothing in the stdlib named `Datastore`. It's clearly part of some library being used by the server. I believe both Google App Engine and Dropbox use that name, but I'm sure there are other libraries that do as well. If you don't know where your data are coming from and what they represent, there's not much you can do with them… – abarnert Sep 15 '14 at 23:29
1

One option would be to utilize a regular expression here:

import re

data = """
DataStore.prime('ws-stage-stat',
{ against: 0, field: 2, stageId: 9155, teamId: 26, type: 8 },
[[['goal','fastbreak','leftfoot',[1]],['goal','openplay','leftfoot',[2]],
['goal','openplay','rightfoot',[1]],['goal','owngoal','leftfoot',[1]],
['goal','penalty','rightfoot',[1]],['miss','corner','header',[6]],
['miss','corner','leftfoot',[2]],['miss','corner','rightfoot',[2]],
['miss','crossedfreekick','header',[1]],['miss','openplay','header',[4]],
['miss','openplay','leftfoot',[11]],['miss','openplay','rightfoot',[27]]]]
"""

# parse js
pattern = re.compile("\[([^\[]+?),\[(\d+)\]\]")

print pattern.findall(data)

Prints:

[
    ("'goal','fastbreak','leftfoot'", '1'), 
    ("'goal','openplay','leftfoot'", '2'),
    ...
    ("'miss','openplay','rightfoot'", '27')
]

\[([^\[]+?),\[(\d+)\]\] would basically match the groups in square brackets. Parenthesis here help to capture certain parts of the matched string; backslashes help to escape characters that have a special meaning in regex, like [ and ].


Another option, since this looks suspiciously like a part of javascript code, would be to use a javascript parser. I've successfully used slimit module, here are relevant threads with examples:

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • @user3045351 well, I rechecked the content-type of the `response` - it says `text/html; charset=utf-8`, but this is clearly a part of javascript code since there is a `DataStore` based logic inside the `script`s on the web page. Does the solution work for you? – alecxe Sep 16 '14 at 15:44
  • yes, this is almost there, except the answer is gives a response that appears to still be in unicode like so: (u"'goal','corner','rightfoot'", u'1') i've tried decoding and encoding the put it still stays in that format. any idea how to get rid of the 'u'? thanks – gdogg371 Sep 16 '14 at 18:02
  • @user3045351 you don't need to get rid of the `u` (see http://stackoverflow.com/questions/2464959/whats-the-u-prefix-in-a-python-string). – alecxe Sep 16 '14 at 18:05
1

Running your code and using the response.text you can split the text and get the list of data, then use an ordereddict to hold the required data.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import requests


class ExampleSpider(CrawlSpider):
    name = "goal2"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com/Teams/32/Statistics/England-Manchester-United"]
    download_delay = 5

    rules = [Rule(SgmlLinkExtractor(allow=('http://www.whoscored.com/Teams/32/Statistics/England-Manchester-United'),deny=('/News', '/Graphics', '/Articles', '/Live', '/Matches', '/Explanations', '/Glossary', 'ContactUs', 'TermsOfUse', 'Jobs', 'AboutUs', 'RSS'),), follow=False, callback='parse_item')]

    def parse_item(self, response):

        url = 'http://www.whoscored.com/stagestatfeed'
        params = {
            'against': '0',
            'field': '2',
            'stageId': '9155',
            'teamId': '32',
            'type': '8'
            }
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36',
           'X-Requested-With': 'XMLHttpRequest',
           'Host': 'www.whoscored.com',
           'Referer': 'http://www.whoscored.com/'}

        responser = requests.get(url, params=params, headers=headers)

        resp = responser.text
        from ast import literal_eval
        from collections import OrderedDict
        d = OrderedDict()
        for line in resp.split():
            if line.startswith("[[["):
                break
        l = literal_eval(line)
        count = 1
        for  sub_ele in l[0]:
            print sub_ele[-1]
            d["var{}".format(count)] = ", ".join(sub_ele[:-1])
            count += 1
            print sub_ele[-1][0],count
            if sub_ele[-1][0]:
                d["var{}".format(count)] = sub_ele[-1][0]
                count +=1
        print d

OrderedDict([('var1', 'goal, corner, rightfoot'), ('var2', 1), ('var3', 'goal, directfreekick, leftfoot'), ('var4', 1), ('var5', 'goal, openplay, leftfoot'), ('var6', 2), ('var7', 'goal, openplay, rightfoot'), ('var8', 2), ('var9', 'miss, corner, header'), ('var10', 5), ('var11', 'miss, corner, rightfoot'), ('var12', 1), ('var13', 'miss, directfreekick, leftfoot'), ('var14', 1), ('var15', 'miss, directfreekick, rightfoot'), ('var16', 2), ('var17', 'miss, openplay, header'), ('var18', 4), ('var19', 'miss, openplay, leftfoot'), ('var20', 14), ('var21', 'miss, openplay, rightfoot'), ('var22', 16)])
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321