1

How can I select, in a file with 3, 4 or X columns separated by space (not constant space, but multiple spaces on each line) select the first 2 columns of each row with a regex?

My files consist of : IP [SPACES] Subnet_Mask [SPACES] NEXT_HOP_IP [NEW LINE]

All rows use that format. How can I extract only the first 2 columns? (IP & Subnet mask)

Here is an example on which to try your regex:

10.97.96.0 10.97.97.128 47.73.1.0
47.73.4.128 47.73.7.6 47.73.8.0
47.73.15.0   47.73.40.0   47.73.41.0
85.205.9.164 85.205.14.44 172.17.103.0
172.17.103.8 172.17.103.48 172.17.103.56
172.17.103.96         172.17.103.100       172.17.103.136
172.17.103.140 172.17.104.44            172.17.105.28
172.17.105.32       172.17.105.220      172.17.105.224

Don't look to the specific IPs. I know the second column is not formed of valid address masks. It's just an example.

I already tried:

(?P<IP_ADD>\s*[1-9][0-9]{1,2}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})(?P<space>\s*)(?P<MASK>[1-9][0-9]{1,2}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}(\s+|\D*))

But it doesn't quite work...

mdeous
  • 17,513
  • 7
  • 56
  • 60
Con7e
  • 225
  • 4
  • 20
  • Why do you need a regex here? Use `csv` module or just split each line by space. – alecxe Apr 24 '14 at 13:54
  • I need some sort of "one liner". I don't want to open the file, close it, ecc. Need something "quick and dirty". – Con7e Apr 24 '14 at 13:57
  • So to be sure, you want to parse the file without opening it? – Robin Apr 24 '14 at 14:00
  • I need it for my job. I don't have time to always pass in a file or save all the text in a file. I'd like to just put some random string and get the result I want – Con7e Apr 24 '14 at 14:02
  • So what's not working for you in the one line solution from the duplicate link? Multiline? – Robin Apr 24 '14 at 14:13

4 Answers4

1

One liner it is:

[s.split()[:2] for s in string.split('\n')]

Example

string = """10.97.96.0 10.97.97.128 47.73.1.0
47.73.4.128 47.73.7.6 47.73.8.0
47.73.15.0   47.73.40.0   47.73.41.0
85.205.9.164 85.205.14.44 172.17.103.0
172.17.103.8 172.17.103.48 172.17.103.56
172.17.103.96         172.17.103.100       172.17.103.136
172.17.103.140 172.17.104.44            172.17.105.28
172.17.105.32       172.17.105.220      172.17.105.224"""

print [s.split()[:2] for s in string.split('\n')]

Outputs

[['10.97.96.0', '10.97.97.128']
['47.73.4.128', '47.73.7.6']
['47.73.15.0', '47.73.40.0']
['85.205.9.164', '85.205.14.44']
['172.17.103.8', '172.17.103.48']
['172.17.103.96', '172.17.103.100']
['172.17.103.140', '172.17.104.44']
['172.17.105.32', '172.17.105.220']]
Robin
  • 9,415
  • 3
  • 34
  • 45
  • why using a regex to split on newlines? using `string.split('\n')` would be much less a performance overhead. – mdeous Apr 24 '14 at 14:29
  • @MatToufoutu: You're absolutely right. I was using raw string and didn't understand why `string.split(r'\n')` wasn't working... Thanks ! – Robin Apr 24 '14 at 14:33
1

With a regular expression:

If you want to get the 2 first columns, whatever they contain, and whatever amount of space separates them, you can use \S (matches anything but whitespaces) and \s (matches whitespaces only) to achieve that:

import re
lines = """
    47.73.4.128 47.73.7.6 47.73.8.0
    47.73.15.0   47.73.40.0   47.73.41.0
    85.205.9.164 85.205.14.44 172.17.103.0
    172.17.103.8 172.17.103.48 172.17.103.56
    172.17.103.96         172.17.103.100       172.17.103.136
    172.17.103.140 172.17.104.44            172.17.105.28
    172.17.105.32       172.17.105.220      172.17.105.224
"""
regex = re.compile(r'(\S+)\s+(\S+)')
regex.findall(lines)

Result:

[('10.97.96.0', '10.97.97.128'),
 ('47.73.1.0', '47.73.4.128'),
 ('47.73.7.6', '47.73.8.0'),
 ('47.73.15.0', '47.73.40.0'),
 ('47.73.41.0', '85.205.9.164'),
 ('85.205.14.44', '172.17.103.0'),
 ('172.17.103.8', '172.17.103.48'),
 ('172.17.103.56', '172.17.103.96'),
 ('172.17.103.100', '172.17.103.136'),
 ('172.17.103.140', '172.17.104.44'),
 ('172.17.105.28', '172.17.105.32'),
 ('172.17.105.220', '172.17.105.224')]

Without a regular expression

If you didn't want to use a regex, and still be able to handle multiple spaces, you could also do:

while '  ' in lines:  # notice the two-spaces-string
    lines = lines.replace('  ', ' ')
columns = [line.split(' ')[:2] for line in lines.split('\n') if line]

Pros and cons:

The advantage of using a regex is that it would also parse the data properly if separators include tabulations, which wouldn't be the case with the 2nd solution. On the other hand, regular expressions require more computing than a simple string splitting, which could make a difference on very large data sets.

mdeous
  • 17,513
  • 7
  • 56
  • 60
  • This is AWESOME. Too bad I can only select 1 best answer... thank you for your help guys! – Con7e Apr 24 '14 at 14:39
  • Just FYI, `split()` already takes care of the multiple spaces (see [doc](https://docs.python.org/2/library/stdtypes.html#str.split)) so no need for the `while` loop, `[\S]` is the same as `\S` and it's usually a good habit to use raw strings for regexes (and not for normal strings, as I remembered the hard way :) ) even if not *technically* mandatory here – Robin Apr 24 '14 at 14:49
  • yes, but split without args would also split on newlines, which isn't a desired behavior (there would be no distinction between lines). you're right regarding regexes, I wrote it first with `[^\s]`, and forgot the brackets. edited – mdeous Apr 24 '14 at 14:52
  • Yeah, however since you already split on `\n` you basically work line by line. Anyway, that's a nice answer. – Robin Apr 24 '14 at 15:09
0

Edited to perform space match with any number of spaces.

You can accomplish this with python regular expressions like this as an option if you know it's going to be the first 2 space separated values.

A nice regex cheat sheet will also help you find out some shortcuts. Specific tokens classes like words, spaces, and numbers have these little shortcuts.

import re
line = "10.97.96.0 10.97.97.128 47.73.1.0"
result = re.split("\s+", line)[0:2]

result
['10.97.96.0', '10.97.97.128']
NanoBennett
  • 1,802
  • 1
  • 13
  • 13
0

Since you need "some sort of one-liner", there are many ways that does not involve python. Maybe:

| awk '{print $1,$2}'

with anything that produces your input on stdout.

Trygve Flathen
  • 686
  • 7
  • 15