Suggestion for python regex and selecting columns

Question

How can I select, in a file with 3, 4 or X columns separated by space (not constant space, but multiple spaces on each line) select the first 2 columns of each row with a regex?

My files consist of : IP [SPACES] Subnet_Mask [SPACES] NEXT_HOP_IP [NEW LINE]

All rows use that format. How can I extract only the first 2 columns? (IP & Subnet mask)

Here is an example on which to try your regex:

10.97.96.0 10.97.97.128 47.73.1.0
47.73.4.128 47.73.7.6 47.73.8.0
47.73.15.0   47.73.40.0   47.73.41.0
85.205.9.164 85.205.14.44 172.17.103.0
172.17.103.8 172.17.103.48 172.17.103.56
172.17.103.96         172.17.103.100       172.17.103.136
172.17.103.140 172.17.104.44            172.17.105.28
172.17.105.32       172.17.105.220      172.17.105.224

Don't look to the specific IPs. I know the second column is not formed of valid address masks. It's just an example.

I already tried:

(?P<IP_ADD>\s*[1-9][0-9]{1,2}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})(?P<space>\s*)(?P<MASK>[1-9][0-9]{1,2}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}(\s+|\D*))

But it doesn't quite work...

Why do you need a regex here? Use `csv` module or just split each line by space. — alecxe, Apr 24 '14 at 13:54
I need some sort of "one liner". I don't want to open the file, close it, ecc. Need something "quick and dirty". — Con7e, Apr 24 '14 at 13:57
So to be sure, you want to parse the file without opening it? — Robin, Apr 24 '14 at 14:00
I need it for my job. I don't have time to always pass in a file or save all the text in a file. I'd like to just put some random string and get the result I want — Con7e, Apr 24 '14 at 14:02
So what's not working for you in the one line solution from the duplicate link? Multiline? — Robin, Apr 24 '14 at 14:13

Robin · Answer 1 · 2014-04-24T14:31:32.190

1

One liner it is:

[s.split()[:2] for s in string.split('\n')]

Example

string = """10.97.96.0 10.97.97.128 47.73.1.0
47.73.4.128 47.73.7.6 47.73.8.0
47.73.15.0   47.73.40.0   47.73.41.0
85.205.9.164 85.205.14.44 172.17.103.0
172.17.103.8 172.17.103.48 172.17.103.56
172.17.103.96         172.17.103.100       172.17.103.136
172.17.103.140 172.17.104.44            172.17.105.28
172.17.105.32       172.17.105.220      172.17.105.224"""

print [s.split()[:2] for s in string.split('\n')]

Outputs

[['10.97.96.0', '10.97.97.128']
['47.73.4.128', '47.73.7.6']
['47.73.15.0', '47.73.40.0']
['85.205.9.164', '85.205.14.44']
['172.17.103.8', '172.17.103.48']
['172.17.103.96', '172.17.103.100']
['172.17.103.140', '172.17.104.44']
['172.17.105.32', '172.17.105.220']]

edited Apr 24 '14 at 14:31

answered Apr 24 '14 at 14:25

Robin

9,415
3
34
45

why using a regex to split on newlines? using `string.split('\n')` would be much less a performance overhead. – mdeous Apr 24 '14 at 14:29
@MatToufoutu: You're absolutely right. I was using raw string and didn't understand why `string.split(r'\n')` wasn't working... Thanks ! – Robin Apr 24 '14 at 14:33

mdeous · Accepted Answer · 2014-04-24T14:50:56.083

With a regular expression:

If you want to get the 2 first columns, whatever they contain, and whatever amount of space separates them, you can use \S (matches anything but whitespaces) and \s (matches whitespaces only) to achieve that:

import re
lines = """
    47.73.4.128 47.73.7.6 47.73.8.0
    47.73.15.0   47.73.40.0   47.73.41.0
    85.205.9.164 85.205.14.44 172.17.103.0
    172.17.103.8 172.17.103.48 172.17.103.56
    172.17.103.96         172.17.103.100       172.17.103.136
    172.17.103.140 172.17.104.44            172.17.105.28
    172.17.105.32       172.17.105.220      172.17.105.224
"""
regex = re.compile(r'(\S+)\s+(\S+)')
regex.findall(lines)

Result:

[('10.97.96.0', '10.97.97.128'),
 ('47.73.1.0', '47.73.4.128'),
 ('47.73.7.6', '47.73.8.0'),
 ('47.73.15.0', '47.73.40.0'),
 ('47.73.41.0', '85.205.9.164'),
 ('85.205.14.44', '172.17.103.0'),
 ('172.17.103.8', '172.17.103.48'),
 ('172.17.103.56', '172.17.103.96'),
 ('172.17.103.100', '172.17.103.136'),
 ('172.17.103.140', '172.17.104.44'),
 ('172.17.105.28', '172.17.105.32'),
 ('172.17.105.220', '172.17.105.224')]

Without a regular expression

If you didn't want to use a regex, and still be able to handle multiple spaces, you could also do:

while '  ' in lines:  # notice the two-spaces-string
    lines = lines.replace('  ', ' ')
columns = [line.split(' ')[:2] for line in lines.split('\n') if line]

Pros and cons:

The advantage of using a regex is that it would also parse the data properly if separators include tabulations, which wouldn't be the case with the 2nd solution. On the other hand, regular expressions require more computing than a simple string splitting, which could make a difference on very large data sets.

This is AWESOME. Too bad I can only select 1 best answer... thank you for your help guys! — Con7e, Apr 24 '14 at 14:39
Just FYI, `split()` already takes care of the multiple spaces (see [doc](https://docs.python.org/2/library/stdtypes.html#str.split)) so no need for the `while` loop, `[\S]` is the same as `\S` and it's usually a good habit to use raw strings for regexes (and not for normal strings, as I remembered the hard way :) ) even if not *technically* mandatory here — Robin, Apr 24 '14 at 14:49
yes, but split without args would also split on newlines, which isn't a desired behavior (there would be no distinction between lines). you're right regarding regexes, I wrote it first with `[^\s]`, and forgot the brackets. edited — mdeous, Apr 24 '14 at 14:52
Yeah, however since you already split on `\n` you basically work line by line. Anyway, that's a nice answer. — Robin, Apr 24 '14 at 15:09

NanoBennett · Answer 3 · 2014-04-24T14:33:18.457

0

Edited to perform space match with any number of spaces.

You can accomplish this with python regular expressions like this as an option if you know it's going to be the first 2 space separated values.

A nice regex cheat sheet will also help you find out some shortcuts. Specific tokens classes like words, spaces, and numbers have these little shortcuts.

import re
line = "10.97.96.0 10.97.97.128 47.73.1.0"
result = re.split("\s+", line)[0:2]

result
['10.97.96.0', '10.97.97.128']

edited Apr 24 '14 at 14:33

answered Apr 24 '14 at 14:04

NanoBennett

1,802
1
13
13

How about multiple spaces? My string is not always separated by only 1 space (see example) – Con7e Apr 24 '14 at 14:07
Replace the \s with a \s* or \s+ to match multiple spaces – NanoBennett Apr 24 '14 at 14:16

score 0 · Answer 4 · answered Apr 24 '14 at 14:13

0

Since you need "some sort of one-liner", there are many ways that does not involve python. Maybe:

| awk '{print $1,$2}'

with anything that produces your input on stdout.

answered Apr 24 '14 at 14:13

Trygve Flathen

686
7
15

Problem is I am on Windows. – Con7e Apr 24 '14 at 14:16

Suggestion for python regex and selecting columns

4 Answers4

With a regular expression:

Without a regular expression

Pros and cons: