I want to extract the table containing the IP blocks from this site.
Looking at the HTML source I can clearly see that the area I want is structured like this:
[CONTENT BEFORE TABLE]
<table border="1" cellpadding="6" bordercolor="#000000">
[IP ADDRESSES AND OTHER INFO]
</table>
[CONTENT AFTER TABLE]
So I wrote this little snippet:
import urllib2,re
from lxml import html
response = urllib2.urlopen('http://www.nirsoft.net/countryip/za.html')
content = response.read()
print re.match(r"(.*)<table border=\"1\" cellpadding=\"6\" bordercolor=\"#000000\">(.*)</table>(.*)",content)
The content's of the page is fetched (and correct) without problems. The regex match always returns None
however (the print here is just for debugging).
Considering the structure of the page, I can't understand why there isn't a match. I would expect there to be three groups with the second group being the table contents.