Regex throwing exception in python

Question

I'm programming a ticker to get walmart out of stock and price changes... But I'm stuck: When I try to get the id of the item (ending number in the link) I can't parse it. Here is the code

# -*- coding: utf-8 -*-

import re
import urllib2

def walmart():
    fileprod = urllib2.urlopen("http://testh3x.altervista.org/walmart.txt").read()
    prods = fileprod.split("|")
    print prods
    lenp = len(prods)
    counter = 0
    while 1:
        while counter < lenp:
            data = urllib2.urlopen(prods[counter]).read()
            path = re.compile("class=\"Outofstock\"") #\s space - \w char - \W Tutto meno che char - 
            matching = path.match(data)
            if matching == None: 
                pass
            else:
                print "Out of stock"
            name = re.compile("\d") 
            m = name.match(str(prods[counter])).group #prods counter è il link
            print m


def main():
    walmart()

if __name__ == "__main__":
    main()

It throws:

  File "C:\Users\Leonardo\Desktop\BotDevelop\ticker.py", line 22, in walmart
    m = name.match(str(prods[counter])).group #prods counter ├¿ il link
AttributeError: 'NoneType' object has no attribute 'group'

You don't need to compile `re` every loop - you may do this before `while`. also, you could rewrite `"class=\"Outofstock\""` with single outer quote `'class="Outofstock"'`, so you don't need to escape double quotes — akaRem, Mar 15 '14 at 12:23
Just as a comment, parsing html with regex ain't a very good idea: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Paulo Bu, Mar 15 '14 at 12:24

score 3 · Answer 1 · answered Mar 15 '14 at 12:19

You should check into BeautifulSoup, which makes parsing html manageable and rather easy. Regexes won't usually do very well.

To answer your question, though, your error comes from the fact that no matches were found. In general, it is better to run a regex like this:

m = name.match(str(prods[counter]))  # if no match is found, then None is returned
if m:
    m = m.group()  # be sure to call the method here

Martijn Pieters · Accepted Answer · 2014-03-15T12:31:41.550

Your regular expression didn't match. You are using re.match() instead of re.search(); the former only matches at the start of a string:

m = name.search(str(prods[counter])).group()

You don't need to re-compile your regular expressions in the loop either; move those out of the loops and compile them just once.

You really should not be using regular expressions to parse HTML, when there are better tools available. Use BeautifulSoup instead.

You should also just loop over prods directly, there is no need for a while loop there:

import urllib
from bs4 import BeautifulSoup

fileprod = urllib2.urlopen("http://testh3x.altervista.org/walmart.txt").read()
prods = fileprod.split("|")

for prod in prods:
    # split off last part of the URL for the product code
    product_code = prod.rsplit('/', 1)[-1]

    data = urllib2.urlopen(prod).read()
    soup = BeautifulSoup(data)
    if soup.find(class_='Outofstock'):
        print product_code, 'out of stock!'
        continue

    price = soup.find('span', class_='camelPrice').text
    print product_code, price

For your starter URL, that prints:

7812821 $32.98

I'm parsing a link like this: http://www.walmart.com/ip/Regalo-Easy-Open-Baby-Gate/7812821 and i want to get the final number... — user3423076, Mar 15 '14 at 12:26
@user3423076: yes, I see what you were trying to do with that parsing line. Splitting text is much easier. — Martijn Pieters, Mar 15 '14 at 12:28

Regex throwing exception in python

2 Answers2