1

I want to extract the table containing the IP blocks from this site.

Looking at the HTML source I can clearly see that the area I want is structured like this:

[CONTENT BEFORE TABLE]
<table border="1" cellpadding="6" bordercolor="#000000">
[IP ADDRESSES AND OTHER INFO]
</table>
[CONTENT AFTER TABLE]

So I wrote this little snippet:

import urllib2,re
from lxml import html
response = urllib2.urlopen('http://www.nirsoft.net/countryip/za.html')

content = response.read()

print re.match(r"(.*)<table border=\"1\" cellpadding=\"6\" bordercolor=\"#000000\">(.*)</table>(.*)",content)

The content's of the page is fetched (and correct) without problems. The regex match always returns None however (the print here is just for debugging).

Considering the structure of the page, I can't understand why there isn't a match. I would expect there to be three groups with the second group being the table contents.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Juicy
  • 11,840
  • 35
  • 123
  • 212

2 Answers2

2

By default, . does not match newlines. You need to specify the dot-all flag to have it do this:

re.match(..., content, re.DOTALL)

Below is a demonstration:

>>> import re
>>> content = '''
... [CONTENT BEFORE TABLE]
... <table border="1" cellpadding="6" bordercolor="#000000">
... [IP ADDRESSES AND OTHER INFO]
... </table>
... [CONTENT AFTER TABLE]
... '''
>>> pat = r"(.*)<table border=\"1\" cellpadding=\"6\" bordercolor=\"#000000\">(.*)</table>(.*)"
>>> re.match(pat, content, re.DOTALL)
<_sre.SRE_Match object at 0x02520520>
>>> re.match(pat, content, re.DOTALL).group(2)
'\n[IP ADDRESSES AND OTHER INFO]\n'
>>>

The dot-all flag can also be activated by using re.S or by placing (?s) at the start of your pattern.

1

For parsing HTML i would prefer BeautifulSoup:

from bs4 import BeautifulSoup
import urllib2
soup = BeautifulSoup(urllib2.urlopen('http://www.nirsoft.net/countryip/za.html').read())
for x in soup.find_all('table', attrs={'border':"1",'cellpadding':"6",'bordercolor':"#000000"}):
    print x

for better result:

for x in soup.find_all('table', attrs={'border':"1",'cellpadding':"6",'bordercolor':"#000000"}):
    for y in x:
        try:
            if y.name == 'tr':
                print "\t".join(y.get_text().split())
       except:pass
Hackaholic
  • 19,069
  • 5
  • 54
  • 72