Extracting a table from webpage with regex

Question

I want to extract the table containing the IP blocks from this site.

Looking at the HTML source I can clearly see that the area I want is structured like this:

[CONTENT BEFORE TABLE]
<table border="1" cellpadding="6" bordercolor="#000000">
[IP ADDRESSES AND OTHER INFO]
</table>
[CONTENT AFTER TABLE]

So I wrote this little snippet:

import urllib2,re
from lxml import html
response = urllib2.urlopen('http://www.nirsoft.net/countryip/za.html')

content = response.read()

print re.match(r"(.*)<table border=\"1\" cellpadding=\"6\" bordercolor=\"#000000\">(.*)</table>(.*)",content)

The content's of the page is fetched (and correct) without problems. The regex match always returns None however (the print here is just for debugging).

Considering the structure of the page, I can't understand why there isn't a match. I would expect there to be three groups with the second group being the table contents.

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — sshashank124, Nov 08 '14 at 20:17
@sshashank124 Yes, I need to demonstrate regex to extract the table and I can't figure out why it doesn't work with this large string when I can get regex to work on other strings. — Juicy, Nov 08 '14 at 20:19

score 2 · Accepted Answer · answered Nov 08 '14 at 20:21

By default, . does not match newlines. You need to specify the dot-all flag to have it do this:

re.match(..., content, re.DOTALL)

Below is a demonstration:

>>> import re
>>> content = '''
... [CONTENT BEFORE TABLE]
... <table border="1" cellpadding="6" bordercolor="#000000">
... [IP ADDRESSES AND OTHER INFO]
... </table>
... [CONTENT AFTER TABLE]
... '''
>>> pat = r"(.*)<table border=\"1\" cellpadding=\"6\" bordercolor=\"#000000\">(.*)</table>(.*)"
>>> re.match(pat, content, re.DOTALL)
<_sre.SRE_Match object at 0x02520520>
>>> re.match(pat, content, re.DOTALL).group(2)
'\n[IP ADDRESSES AND OTHER INFO]\n'
>>>

The dot-all flag can also be activated by using re.S or by placing (?s) at the start of your pattern.

Thanks! Didn't know about the DOTALL – Juicy Nov 08 '14 at 20:22 — Juicy, Nov 08 '14 at 20:22

Hackaholic · Answer 2 · 2014-11-08T20:43:23.983

1

For parsing HTML i would prefer BeautifulSoup:

from bs4 import BeautifulSoup
import urllib2
soup = BeautifulSoup(urllib2.urlopen('http://www.nirsoft.net/countryip/za.html').read())
for x in soup.find_all('table', attrs={'border':"1",'cellpadding':"6",'bordercolor':"#000000"}):
    print x

for better result:

for x in soup.find_all('table', attrs={'border':"1",'cellpadding':"6",'bordercolor':"#000000"}):
    for y in x:
        try:
            if y.name == 'tr':
                print "\t".join(y.get_text().split())
       except:pass

edited Nov 08 '14 at 20:43

answered Nov 08 '14 at 20:34

Hackaholic

19,069
5
54
72

Thanks, I needed regex but I will look into Beautiful Soup it looks neat – Juicy Nov 08 '14 at 20:42
1

@Juicy BeautifulSoup is great Utility to parse an html page – Hackaholic Nov 08 '14 at 20:49

Extracting a table from webpage with regex

2 Answers2