0

I have a webpage, from which I get its text using the resources module in Python. But, I'm not getting it, how to get a pattern of numbers like 126.23.73.34 from the document and extract it out using the re module?

user229044
  • 232,980
  • 40
  • 330
  • 338
Sazid
  • 2,747
  • 1
  • 22
  • 34
  • 1
    if you want to extract IP, this could help -> http://stackoverflow.com/questions/2890896/extract-ip-address-from-an-html-string-python – Kumar Vikramjeet May 03 '13 at 10:48

3 Answers3

3

You can use the regex for IPs d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

text = "126.23.73.34";
match = re.search(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', text)
if match:
   print "match.group(1) : ", match.group(0)

If you are looking for a complete regex to get IPv4 addresses you can find the most appropriate regex here.

To restrict all 4 numbers in the IP address to 0-255, you can use this one taken from the source above:

(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
eandersson
  • 25,781
  • 8
  • 89
  • 110
  • This is the correct regex for IPv4 addresses btw: `\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b` – tamasgal May 03 '13 at 10:57
  • Yep. I wasn't sure if he was looking for an IP, but I assumed as much. I'll include a link as a reference. – eandersson May 03 '13 at 10:58
  • I'm not sure the format of your answer result is what the OP would want, see: C:\wamp\www>Example.py ('192', '168', '0', '1') ('192', '168', '0', '254') – rcbevans May 03 '13 at 11:31
  • 1
    @o0rebelious0o try `print match.group()` – eandersson May 03 '13 at 11:45
1

If if it is an html text; you could use an html parser (such as BeautifulSoup) to parse it, a regex to select some strings that look like an ip, and socket module to validate ips:

import re
import socket
from bs4 import BeautifulSoup # pip install beautifulsoup4

def isvalid(addr):
    try:
        socket.inet_aton(addr)
    except socket.error:
        return False
    else:
        return True

soup = BeautifulSoup(webpage)
ipre = re.compile(r"\b\d+(?:\.\d+){3}\b") # matches some ips and more
ip_addresses = [ip for ips in map(ipre.findall, soup(text=ipre))
                for ip in ips if isvalid(ip)]

Note: it extracts ips only from text e.g., it ignores ips in html attributes.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • @Sazid: It is a library that you can use to extract info from HTML text. I've added link to its docs – jfs May 04 '13 at 12:52
0

You can use this. It will only accept VALID IP addresses:

import re
pattern = "\\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\b"
text = "192.168.0.1 my other IP is 192.168.0.254 but this one isn't a real ip 555.555.555.555"
m = re.findall(pattern, text)
for i in m :
    print(i)

OUTPUT:

C:\wamp\www>Example.py
192.168.0.1
192.168.0.254

--Tested and working

rcbevans
  • 7,101
  • 4
  • 30
  • 46
  • Sure that works, but what if it is not a valid IP? e.g. 555.168.0.1? – eandersson May 03 '13 at 11:04
  • The question is, and I quote, "get a pattern of numbers like 126.23.73.34 from the document and extract it" It didn't say anything about actually validating the extracted values – rcbevans May 03 '13 at 11:08
  • That doesn't mean that other people won't look at this question a month, or year from now. It is always in the best interest of the community to provide the most complete answer possible. – eandersson May 03 '13 at 11:14