I am using Python.org version 2.7 64 bit on Windows Vista 64 bit. I am trying to use Scrapy with Regex to parse all the content of an item on the following page called 'DataStore.prime(\'standings\'. If I use the code:
regex = re.compile('DataStore\.prime\(\'standings\', { stageId: \d+ }.*', re.S)
match2 = re.search(regex, response.body).group()
match3 = str(match2)
match3 = match3.replace('<a class="w h"', '').replace('<a class="w a"', '').replace('<a class="d h"', '') \
.replace('<a class="d a"', '').replace('<a class="l h"', '').replace('<a class="l a"', '') \
.replace('title=', '')
print match3
I get everything on the page, after where the Regex is found. Which is not what I want. I only want the data stored within the item. I have also tried:
regex = re.compile(r'\[\[.*?\].*')
match2 = re.search(regex, response.body).group()
match3 = str(match2)
match3 = match3.replace('<a class="w h"', '').replace('<a class="w a"', '').replace('<a class="d h"', '') \
.replace('<a class="d a"', '').replace('<a class="l h"', '').replace('<a class="l a"', '') \
.replace('title=', '')
print match3
This returns the first sub section in the 'Datastore.prime' item I am interested in, up to the first closing ']'. This method is not pointing the Regex to the item I am interested on the page. I think what I need is a hybrid of the two. I have tried using a final Regex of:
regex = re.compile('DataStore\.prime\(\'standings\', { stageId: \d+ } \[\[.*?\]\]\);.*', re.S)
But this now returns a different part of the page entirely. I'm almost there with it, but I can't quite get the last bit right.
Can anyone assist?
Thanks
EDIT:
Here is some sample script from what I am trying to scrape:
DataStore.prime('standings', { stageId: 7794 }, [[Some sample stats here],[[Some sample stats here],[[Some sample stats here]]);
Please note in the above example the 'StageId: 7794' is a dynamic variable that will change from page to page where this data structure is encountered and thus cannot be included in any kind of regex or other parsing method.