Python Regex for parsing site

Question

I am trying to write python script to pull data from a site and place it into a json string.

The site is http://mtc.sri.com/live_data/attackers/.

I have python pulling the source, but can't quite figure out the regex portion

When I use RegExr, this regex works:

But when I put it into the script, I get no match.

#!/usr/bin/python
import urllib2
import re

f = urllib2.urlopen("http://mtc.sri.com/live_data/attackers/")
out = f.read();

matchObj = re.match( r'</?table[^>]*>|</?tr[^>]*>|</?td[^>]*>|</?thead[^>]*>|</?tbody[^>]*>|</?font[^>]*>', out, re.M|re.I)

if matchObj:
   print "matchObj.group() : ", matchObj.group()
   print "matchObj.group(1) : ", matchObj.group(1)
   print "matchObj.group(2) : ", matchObj.group(2)
else:
   print "No match!!"

Any idea why I am not getting the appropriate response?

Edit:

Per a suggestion below, I used:

matchObj = re.findall( r'</?(?:table|t[dr]|thead|tbody|font)[^>]*>', out, re.M|re.I)

for i in matchObj.pop():
    print i

However, this simply outputs:

<
/
t
a
b
l
e
>

Edit 2:

I was using .pop() on the matchObj for some reason. Took that off. Now I am getting alot more of a response, but I am just getting the tags, not the data inside. I infact do not care about the tags. I would prefer just the data.

matchObj = re.findall( r'</?(?:table|t[dr]|thead|tbody|font)[^>]*>', out, re.M|re.I)

for i in matchObj:
    print i

Output:

<table class="attackers">
<tr>
</tr>
<tr>
<td>
</td>
<td>
</td>
...

I've never used it, but I know most people on here recommend Beautiful Soup: http://www.crummy.com/software/BeautifulSoup/ — bozdoz, Sep 27 '13 at 18:34
Unfortunately, beautiful soup is not installed on the servers this script is being used for, so we need to use regex. — Sugitime, Sep 27 '13 at 18:35
Trying to parse HTML with Regex is fraught with difficulty. See [here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — , Sep 27 '13 at 18:36

Jerry · Accepted Answer · 2013-09-27T18:44:41.170

3

re.match tests the whole string.

Return None if the string does not match the pattern; note that this is different from a zero-length match.

Use re.search instead.

Scan through string looking for a location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

I think that you can also shorten your regex a bit:

</?(?:table|t[dr]|thead|tbody|font)[^>]*>

And you should have only one match group as there are no capture groups in your regex and that one match will be the first matched pattern.

If you want to get all, use re.findall and the result you'll get will be a list of the matched results.

edited Sep 27 '13 at 18:44

answered Sep 27 '13 at 18:37

Jerry

70,495
13
100
144

1

Another way of looking at this is that `re.match` is essentially the same as `re.search` except that there is an implicit `^` at the beginning of the regular expression so that it must match at the beginning of the string. – amitparikh Sep 27 '13 at 18:41
I updated the description of the issue with what happens when I use re.findall. re.search still does not return anything. – Sugitime Sep 27 '13 at 21:02
@Sugitime Why are you using `.pop()`? Using this will return the last element of the list and `for i in element:` will break this element into each of its characters. Use `for i in matchObj:` – Jerry Sep 27 '13 at 21:08
I just saw that. I am not sure why I was using .pop(). So I took it off. Question will be updated in a moment. – Sugitime Sep 27 '13 at 21:10
@Sugitime This is a different question :) Your initial regex finds all the tags, but what do you want to get actually? The table with all the tags? If that's so, try changing your regex to: `]*>.*?
`. – Jerry Sep 27 '13 at 21:16
1

I actually was able to get the data I wanted from: ]*?>(.*?)<\/td>. Thank you for your help Jerry. Answer accepted :) – Sugitime Sep 27 '13 at 21:19

Python Regex for parsing site

1 Answers1