0

I've already googled my question but i havent any solution at the moment. I want to grab the IPs and ports from this html content: (I've got this content as a String)

Ive read about beautiful soup and regexp - ive tried both but i cant get a solution - and beautiful soup is very slow. sry for my bad english.

<tr class="proxyListOdd">
<td><a href="http://whois.sc/81.196.122.86" target="_blank">81.196.122.86</a></td>
<td>8080</td>
<td>Nein</td>
<td>3</td>
<td class="proxyList_Ping" >0.44 Sek.</td>
<td><img height="24px" width="24px" alt="Rumänien" title="Rumänien" src="http://static2.proxy-listen.de/0_proxy/images/flags/ro.png"></td>
<td class="proxyList_Online arrowUp">97% </td>
<td>22:06</td>
<td><input style="align: center" title="Proxyserver übernehmen" type="image" src="/0_proxy/images/ProxyswitcherButtonOn.png" onclick="de.proxy_listen.setProxy({'U2a66iQA': '70ODEuMTk2LjEyMi44Ng==', 'uhSRlFfS': '96ODA4MA==', 'h0zMxtxH':'21MQ=='}, 'https://addons.mozilla.org/addon/proxy-listen-de_proxyswitcher/');"></td>
<td><a href='proxy:name=Proxy-listen.de&host=81.196.122.86&port=8080&foxyProxyMode=this&confirmation=popup' title="Proxyserver in FoxyProxy übernehmen."><img height="24px" width="22px" alt="FoxyProxy" src="http://static.proxy-listen.de/0_proxy/images/foxyproxy.png"></a></td>
</tr>
<tr class="proxyListEven">
<td><a href="http://whois.sc/94.126.17.68" target="_blank">94.126.17.68</a></td>
<td>3128</td>
<td>Nein</td>
<td>3</td>
<td class="proxyList_Ping" >0.95 Sek.</td>
<td><img height="24px" width="24px" alt="Schweiz" title="Schweiz" src="http://static2.proxy-listen.de/0_proxy/images/flags/ch.png"></td>
<td class="proxyList_Online arrowUp">86% </td>
<td>22:06</td>
<td><input style="align: center" title="Proxyserver übernehmen" type="image" src="/0_proxy/images/ProxyswitcherButtonOn.png" onclick="de.proxy_listen.setProxy({'U2a66iQA': '65OTQuMTI2LjE3LjY4', 'uhSRlFfS': '78MzEyOA==', 'h0zMxtxH':'52MQ=='}, 'https://addons.mozilla.org/addon/proxy-listen-de_proxyswitcher/');"></td>
<td><a href='proxy:name=Proxy-listen.de&host=94.126.17.68&port=3128&foxyProxyMode=this&confirmation=popup' title="Proxyserver in FoxyProxy übernehmen."><img height="24px" width="22px" alt="FoxyProxy" src="http://static.proxy-listen.de/0_proxy/images/foxyproxy.png"></a></td>
</tr>
<tr class="proxyListOdd">
<td><a href="http://whois.sc/89.105.247.13" target="_blank">89.105.247.13</a></td>
<td>3128</td>
<td>Nein</td>

hope you can help me ;) mfg henry

Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504
wayne
  • 13
  • 2

4 Answers4

3

Use a regular expression:

>>> import re
>>> set(m.group(0) for m in re.finditer(r'([0-9]{1,3}\.){3}[0-9]{1,3}', s))
{'81.196.122.86', '94.126.17.68', '89.105.247.13'}

Note that this regular expression is simplified, and doesn't actually capture all IP addresses (and captures some values that are not). If you want a more precise match, according to inet_addr(3) and RFC 4291, the whole regular expression looks like:

# IPv4, common format
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])|
# IPv4, dotted hexadecimal
(?:0x[0-9a-fA-F]{2}\.){3}0x[0-9a-fA-F]{2}|
# IPv4, dotted octal
0[0-7]{3}\.){3}0[0-7]{3}|
# IPv4, one number, hexadecimal
0x[0-9a-fA-F]{1,8})|
# IPv4, one number, octal
0[0-7]{1,11})|
# IPv4, one number, hexadecimal
[1-4][0-9]{9}|0|[1-9][0-9]{0,7}|
# IPv6, preferred form (RFC 4291 2.2.1)
(?:[0-9a-fA-F]{1,4}){7}[0-9a-fA-F]{1,4}|
# IPv6, compressed syntax (RFC 4291 2.2.2)
(?:
  [0-9a-fA-F]{0,4}::(?:[0-9a-fA-F]{1,4}:){,6}|
  [0-9a-fA-F]{0,4}(?::[0-9a-fA-F]{1,4}){1}::(?:[0-9a-fA-F]{1,4}:){,4}[0-9a-fA-F]{0,4}|
  [0-9a-fA-F]{0,4}(?::[0-9a-fA-F]{1,4}){2}::(?:[0-9a-fA-F]{1,4}:){,3}[0-9a-fA-F]{0,4}|
  [0-9a-fA-F]{0,4}(?::[0-9a-fA-F]{1,4}){3}::(?:[0-9a-fA-F]{1,4}:){,2}[0-9a-fA-F]{0,4}|
  [0-9a-fA-F]{0,4}(?::[0-9a-fA-F]{1,4}){4}::(?:[0-9a-fA-F]{1,4}:){,1}[0-9a-fA-F]{0,4}|
  [0-9a-fA-F]{0,4}(?::[0-9a-fA-F]{1,4}){5}::[0-9a-fA-F]{0,4}
)|
# IPv6, alternative form (RFC 4291 2.2.3, uncompressed)
(?:[0-9a-fA-F]{1,4}){6}|(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]))|
# IPv6, alternative form (RFC 4291 2.2.3, compressed)
(?:
  [0-9a-fA-F]{0,4}::(?:[0-9a-fA-F]{1,4}:){,4}|
  [0-9a-fA-F]{0,4}(?::[0-9a-fA-F]{1,4}){1}::(?:[0-9a-fA-F]{1,4}:){,3}|
  [0-9a-fA-F]{0,4}(?::[0-9a-fA-F]{1,4}){2}::(?:[0-9a-fA-F]{1,4}:){,2}|
  [0-9a-fA-F]{0,4}(?::[0-9a-fA-F]{1,4}){3}::(?:[0-9a-fA-F]{1,4}:){,1}|
  [0-9a-fA-F]{0,4}(?::[0-9a-fA-F]{1,4}){4}::
)
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]))

As you can see, if you really want to match all IP addresses, you should search for the approximate format, and then (if necessary) validate the addresses, for example with ipaddress. Note that the above regular expression is incomplete for your case as it does not cover possible HTML character encodings such as &#x31; for 1.

Community
  • 1
  • 1
phihag
  • 278,196
  • 72
  • 453
  • 469
1

this works only with IPv4:

re.findall('(\d+\.\d+\.\d+\.\d+)&port=(\d+)',s)
Marco de Wit
  • 2,686
  • 18
  • 22
0

See Similar question

EDIT: For this particular case, you would have to do something different and regex out the data from this particular set of HTML data (as IP's appear multiple times):

print [ ":".join((y,z)) for x,y,z in re.findall('proxyList((?=Even)|(?=Odd)).*?_blank">(.*?)</a></td>.*?<td>([0-9]+)</td>',data,flags=re.DOTALL | re.MULTILINE)]

You could have also regex'ed on the 'proxy:name=Proxy-listen' Part, which Marco de Wit does.

Otherwise:

re.findall('(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)',data)

which finds all IPv4 addresses, to add ports onto that, modify it to be:

re.findall('((?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)):([0-9]{1,5})*',data)

Which should find all IP's and ports in this format: XXX.XXX.XXX.XXX:YYYYY (That stated, it doesn't check if the ports are valid.

Community
  • 1
  • 1
Trickfire
  • 443
  • 2
  • 5
0

Have you considered using something like minidom? From the documentation:

xml.dom.minidom is a light-weight implementation of the Document Object Model interface. It is intended to be simpler than the full DOM and also significantly smaller.

anonymous
  • 1,522
  • 14
  • 24