I am trying to parse a JSON item of the following format:
DataStore.prime('ws-stage-stat',
{ against: 0, field: 2, stageId: 9155, teamId: 26, type: 8 },
[[['goal','fastbreak','leftfoot',[1]],['goal','openplay','leftfoot',[2]],
['goal','openplay','rightfoot',[1]],['goal','owngoal','leftfoot',[1]],
['goal','penalty','rightfoot',[1]],['miss','corner','header',[6]],
['miss','corner','leftfoot',[2]],['miss','corner','rightfoot',[2]],
['miss','crossedfreekick','header',[1]],['miss','openplay','header',[4]],
['miss','openplay','leftfoot',[11]],['miss','openplay','rightfoot',[27]]]]
The items in quotes represent a description of types of goals scored or chances missed that are listed on a website. The numbers represents the volume. I'm assuming that this is a JSON array of arrays with mixed text and numerical data. What I would like to do is break this down into python variables in the format of
var1 = "'goal','fastbreak','leftfoot'"
var2 = 1
...and repeat for all elements of the above pattern.
The code that is parsing this data structure is this:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time
import re
import json
import requests
class ExampleSpider(CrawlSpider):
name = "goal2"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com/Teams/32/Statistics/England-Manchester-United"]
download_delay = 5
rules = [Rule(SgmlLinkExtractor(allow=('http://www.whoscored.com/Teams/32/Statistics/England-Manchester-United'),deny=('/News', '/Graphics', '/Articles', '/Live', '/Matches', '/Explanations', '/Glossary', 'ContactUs', 'TermsOfUse', 'Jobs', 'AboutUs', 'RSS'),), follow=False, callback='parse_item')]
def parse_item(self, response):
url = 'http://www.whoscored.com/stagestatfeed'
params = {
'against': '0',
'field': '2',
'stageId': '9155',
'teamId': '32',
'type': '8'
}
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'Host': 'www.whoscored.com',
'Referer': 'http://www.whoscored.com/'}
responser = requests.get(url, params=params, headers=headers)
print responser.text
I've checked the type of responser.text using print type(responser.text)
, which returns a result of 'unicode'. Does this mean that this object is now a set of nested Python lists? If so, how can I parse it so that it returns the data in the format that I am after?
Thanks