1

I am using Python.org version 2.7 64 bit on Windows Vista 64 bit. I am trying to use Scrapy with Regex to parse all the content of an item on the following page called 'DataStore.prime(\'standings\'. If I use the code:

regex = re.compile('DataStore\.prime\(\'standings\', { stageId: \d+ }.*', re.S)
        match2 = re.search(regex, response.body).group()
        match3 = str(match2)
        match3 = match3.replace('<a class="w h"', '').replace('<a class="w a"', '').replace('<a class="d h"', '') \
                 .replace('<a class="d a"', '').replace('<a class="l h"', '').replace('<a class="l a"', '') \
                 .replace('title=', '')
        print match3

I get everything on the page, after where the Regex is found. Which is not what I want. I only want the data stored within the item. I have also tried:

regex = re.compile(r'\[\[.*?\].*')

        match2 = re.search(regex, response.body).group()
        match3 = str(match2)
        match3 = match3.replace('<a class="w h"', '').replace('<a class="w a"', '').replace('<a class="d h"', '') \
                 .replace('<a class="d a"', '').replace('<a class="l h"', '').replace('<a class="l a"', '') \
                 .replace('title=', '')
        print match3

This returns the first sub section in the 'Datastore.prime' item I am interested in, up to the first closing ']'. This method is not pointing the Regex to the item I am interested on the page. I think what I need is a hybrid of the two. I have tried using a final Regex of:

regex = re.compile('DataStore\.prime\(\'standings\', { stageId: \d+ } \[\[.*?\]\]\);.*', re.S)

But this now returns a different part of the page entirely. I'm almost there with it, but I can't quite get the last bit right.

Can anyone assist?

Thanks

EDIT:

Here is some sample script from what I am trying to scrape:

DataStore.prime('standings', { stageId: 7794 }, [[Some sample stats here],[[Some sample stats here],[[Some sample stats here]]);

Please note in the above example the 'StageId: 7794' is a dynamic variable that will change from page to page where this data structure is encountered and thus cannot be included in any kind of regex or other parsing method.

gdogg371
  • 3,879
  • 14
  • 63
  • 107

3 Answers3

3

Don't parse webpages with regular expressions. Use an html parser like Beautiful Soup.

Edit: to elaborate.

Regular expressions are used to recognize and manipulate regular grammars. HTML is context-free and therefore cannot be correctly recognized or manipulated with regular expressions. Instead we use special parsers to manipulate HTML. BeautifulSoup is one of the more popular python html parsers.

Andrew Johnson
  • 3,078
  • 1
  • 18
  • 24
  • 3
    Enjoy the upvotes but this is more of a comment than an answer. Care to elaborate? – Tom Fenech Aug 05 '14 at 20:16
  • @andrewjohnson body = response.xpath('DataStore.prime\('standings', { stageId:').extract() throws up a syntax error around 'standings'? – gdogg371 Aug 05 '14 at 21:11
  • Obligatory: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – ben Aug 06 '14 at 17:00
1

It seems like I've seen dozens of questions for this website in less than a month. Have you at least researched available information ? For example: there's a whole memoire about this website, explaining in length how to scrape it, extract information with betting prediction algorithms, etc.

http://www.diva-portal.org/smash/get/diva2:655630/FULLTEXT01.pdf

Here's an excerpt:

Using the regex matching technique described in Appendix C, we use the pattern "/Datastore.prime(’standings’, { stageId: ".✩stageID."}, [([.*\n,?)+/" and find the source code for the table. An example of how this looks like is given in appendix C. The next step is to further extract each unique matchID from the table-source code. For this, a much less complicated pattern is sufficient, because we know that each match’s ID-tag is inside an HTML hyperlink, and each hyperlink uses the match-ID as an attribute. For example, the following may be a hyperlink contained in the line containing fixtures Arsenal 21are involved in:

Arthur Burkhardt
  • 658
  • 4
  • 13
0

In case anyone is interested, what eventually resolved this was the following:

regex = re.compile('DataStore\.prime\(\'standings\', { stageId: \d+ }, \[\[.*?\]\]?\)?;', re.S)

        match2 = re.search(regex, response.body).group()
        match3 = str(match2)
        match3 = match3.replace('<a class="w h"', '').replace('<a class="w a"', '').replace('<a class="d h"', '') \
                 .replace('<a class="d a"', '').replace('<a class="l h"', '').replace('<a class="l a"', '') \
                 .replace('title=', '')
        print match3

The problem was that the regex was matching multiple instances of the final ']' bracket and the ')'. By specifying the '?' I am now only returning 0-1 instances of those characters, which means only what I want is scraped.

gdogg371
  • 3,879
  • 14
  • 63
  • 107