So I have a document (plain text) that I'm trying to extract all of the IP addresses from. I was able to extract them using regular expressions but it also grabs a large number of version numbers. I tried using string.find()
but it requires that I be able to locate the escape character used for the end of the line (the IP addresses are always the last thing on a line) and the escape character used for the end of the line is unknown to me. Anyone know how I could pull these addresses out?
Asked
Active
Viewed 312 times
-6

Martijn Pieters
- 1,048,767
- 296
- 4,058
- 3,343

user1771694
- 3
- 1
- 2
-
9How about posting a piece of your document and the code you've written so far? – SpankMe May 24 '13 at 20:42
-
1"escape character used for the end of the line" -- do you mean the line separator, usually `\n` or `\r\n`? – Janne Karila May 24 '13 at 20:45
-
look for `re` and use this link http://answers.oreilly.com/topic/318-how-to-match-ipv4-addresses-with-regular-expressions/ – 0x90 May 24 '13 at 20:46
2 Answers
3
If your addresses are always on the end of a line, then anchor on that:
ip_at_end = re.compile(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}$', re.MULTILINE)
This regular expression only matches dotted quads (4 sets of digits with dots in between) at the end of a line.
Demo:
>>> import re
>>> ip_at_end = re.compile(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}$', re.MULTILINE)
>>> example = '''\
... Only addresses on the end of a line match: 123.241.0.15
... Anything else doesn't: 124.76.67.3, even other addresses.
... Anything that is less than a dotted quad also fails, so 1.1.4
... does not match but 1.2.3.4
... will.
... '''
>>> ip_at_end.findall(example)
['123.241.0.15', '1.2.3.4']

Martijn Pieters
- 1,048,767
- 296
- 4,058
- 3,343
-
-
0x90: I was assuming IPv4 because the OP was claiming version numbers were interfering; IPv6 formatted IP addresses, using `:` as a delimiter, rarely are mistaken for software versions.. – Martijn Pieters May 24 '13 at 20:51
-
-
2
Description
this will match and validate ipv4 addresses, and will ensure the individual octects are within a range of 0-255
(?:([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])
Disclaimer
yes I realize the OP asked for a Python solution. This PHP solution is only included to show how the expression works
php example
<?php
$sourcestring="this is a valid ip 12.34.56.78
this is not valid ip 12.34.567.89";
preg_match_all('/(?:(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])/i',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
$matches Array:
(
[0] => Array
(
[0] => 12.34.56.7
)
)

Ro Yo Mi
- 14,790
- 5
- 35
- 43
-
How did you generate that awesome graph? Is there a website that does that for regex input? – SethMMorton May 24 '13 at 20:58
-
@ SethMMorton. Yes, for this I'm using http://www.debuggex.com/. If you use it keep in mind that it supports javascript type expressions and doesn't understand lookbehinds. – Ro Yo Mi May 24 '13 at 21:04
-
@denomales Not sure if you've seen it since you posted this answer, but debuggex can now generate the image for you so you don't have to go through the trouble of copy/pasting/cropping :) – Sergiu Toarca May 31 '13 at 02:01
-
Excellent! You had told me that feature was on the way. It looks really good, thank you :) – Ro Yo Mi May 31 '13 at 02:54