I have a webpage, from which I get its text using the resources module in Python. But, I'm not getting it, how to get a pattern of numbers like 126.23.73.34 from the document and extract it out using the re module?
Asked
Active
Viewed 185 times
0
-
1if you want to extract IP, this could help -> http://stackoverflow.com/questions/2890896/extract-ip-address-from-an-html-string-python – Kumar Vikramjeet May 03 '13 at 10:48
3 Answers
3
You can use the regex for IPs d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}
text = "126.23.73.34";
match = re.search(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', text)
if match:
print "match.group(1) : ", match.group(0)
If you are looking for a complete regex to get IPv4 addresses you can find the most appropriate regex here.
To restrict all 4 numbers in the IP address to 0-255, you can use this one taken from the source above:
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)

eandersson
- 25,781
- 8
- 89
- 110
-
This is the correct regex for IPv4 addresses btw: `\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b` – tamasgal May 03 '13 at 10:57
-
Yep. I wasn't sure if he was looking for an IP, but I assumed as much. I'll include a link as a reference. – eandersson May 03 '13 at 10:58
-
I'm not sure the format of your answer result is what the OP would want, see: C:\wamp\www>Example.py ('192', '168', '0', '1') ('192', '168', '0', '254') – rcbevans May 03 '13 at 11:31
-
1
1
If if it is an html text; you could use an html parser (such as BeautifulSoup
) to parse it, a regex to select some strings that look like an ip, and socket
module to validate ips:
import re
import socket
from bs4 import BeautifulSoup # pip install beautifulsoup4
def isvalid(addr):
try:
socket.inet_aton(addr)
except socket.error:
return False
else:
return True
soup = BeautifulSoup(webpage)
ipre = re.compile(r"\b\d+(?:\.\d+){3}\b") # matches some ips and more
ip_addresses = [ip for ips in map(ipre.findall, soup(text=ipre))
for ip in ips if isvalid(ip)]
Note: it extracts ips only from text e.g., it ignores ips in html attributes.

jfs
- 399,953
- 195
- 994
- 1,670
-
@Sazid: It is a library that you can use to extract info from HTML text. I've added link to its docs – jfs May 04 '13 at 12:52
0
You can use this. It will only accept VALID IP addresses:
import re
pattern = "\\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\b"
text = "192.168.0.1 my other IP is 192.168.0.254 but this one isn't a real ip 555.555.555.555"
m = re.findall(pattern, text)
for i in m :
print(i)
OUTPUT:
C:\wamp\www>Example.py
192.168.0.1
192.168.0.254
--Tested and working

rcbevans
- 7,101
- 4
- 30
- 46
-
Sure that works, but what if it is not a valid IP? e.g. 555.168.0.1? – eandersson May 03 '13 at 11:04
-
The question is, and I quote, "get a pattern of numbers like 126.23.73.34 from the document and extract it" It didn't say anything about actually validating the extracted values – rcbevans May 03 '13 at 11:08
-
That doesn't mean that other people won't look at this question a month, or year from now. It is always in the best interest of the community to provide the most complete answer possible. – eandersson May 03 '13 at 11:14