0

From this example:

>>> re.split('(\W)', 'foo/bar spam\neggs')
['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']

is there a straightforward way to associate the capture group with either the right or left portion of the split? E.g. using the same regex/capture group, but yielding:

['foo', '/bar', ' spam', '\neggs']

or optionally

['foo/', 'bar ', 'spam\n', 'eggs']

I'm sure you can achieve it by changing the actual regex, but that's not the point (and we could modify the example to make the matches more complicated, so that it's a real pain to be unable to just re-use them and push them to the right or left).

Unfortunately it looks like making it a non-capturing group just drops the corresponding characters from the match:

>>> re.split('(?:\W)', 'foo/bar spam\neggs')
['foo', 'bar', 'spam', 'eggs']

By way of another example, consider if you had some text from a misbehaved CSV file. Each line only has one actual comma to split by, but accidentally some lines also have a comma in one of the fields. Luckily, the non-splitting commas are always followed by a space.

csv_data = [
    'Some good data,Id 5',
    'Some bad data, like, really bad, dude,Id 6'
]

The goal in this case is to process this into:

[['Some good data', 'Id 5'],
 ['Some bad data, like, really bad, dude', 'Id 6']]

through the use of a simple re.split.

Using map(lambda x: re.split(",(?:\S)", x), csv_data) produces

[['Some good data', 'd 5'], 
 ['Some bad data, like, really bad, dude', 'd 6']]

and using map(lambda x: re.split(",(\S)", x), csv_data) produces

[['Some good data', 'I', 'd 5'],
 ['Some bad data, like, really bad, dude', 'I', 'd 6']]

So what is a generic approach to re.split that would work the same for both of these cases? Basically something I could wrap in a function, like

def my_split(regex_chars, my_strs):
    return map(lambda x: re.split(...regex_chars..., x), my_strs)

such that both

my_split(r'(\W)', ['foo/bar spam\neggs']) 

and

my_split(r',(\S)', csv_data) 

each returns the expected output as from above.

Note: It appears this is not possible in just re, but could be possible with some mixture of regex and re based on whether the split is zero-width or not.

ely
  • 74,674
  • 34
  • 147
  • 228

2 Answers2

3

No, it is not possible. I'm not aware of any regex engine that supports this sort of thing. Splitting means splitting: you can keep the splitter or you can discard it, but you can't lump it with the pieces between the splits, because the separator is distinct from the things it separates.

With the regex module you can do it fairly simply, but it does require changing the original regex:

>>> regex.split('(?=\W)', 'foo/bar spam\neggs', flags=regex.V1)
['foo', '/bar', ' spam', '\neggs']

Unlike the builtin re module, the regex module allows splitting on zero-width matches, so you can use a lookahead to split at positions where the next character matches \W.

In the example you added in your edit, you can do it with lookahead even with plain re , because the splitter is not zero-width:

>>> map(lambda x: re.split(",(?=\S)", x), csv_data)
[['Some good data', 'Id 5'],
 ['Some bad data, like, really bad, dude', 'Id 6']]
BrenBarn
  • 242,874
  • 37
  • 412
  • 384
  • could you explain what `regex.V1` flag does? – Avinash Raj Feb 09 '15 at 02:56
  • @AvinashRaj: It is described on the documentation page I linked to. – BrenBarn Feb 09 '15 at 02:57
  • I would characterize it a bit differently (see my extended example with commas added to the bottom of the question). It's not necessarily a zero-width split. In the comma example, you want to split by a single character (a comma) but only those specific commas that have a particular property (they are followed immediately by a non-space character). Whatever that non-space character is does not matter and is not part of the split (only part of recognizing the comma) and that non-space character needs to be put somewhere (in my case to the right of the split). Perhaps *this* achievable? – ely Feb 09 '15 at 02:58
  • @Mr.F: If that is what you wanted, you should have said that in your question. I've edited my answer to show how you can do that. – BrenBarn Feb 09 '15 at 03:03
  • I thought I did say it in my question and did not realize the additional example was needed for that to be clear. – ely Feb 09 '15 at 03:05
  • Python `re` already supports splitting on empty matches beginning with 3.7. – Wiktor Stribiżew Feb 11 '22 at 10:36
2

Is that the case you could use negative lookahead based regex like below.

>>> csv_data = [
    'Some good data,Id 5',
    'Some bad data, like, really bad, dude,Id 6'
]
>>> [re.split(r',(?!\s)', i) for i in csv_data]
[['Some good data', 'Id 5'], ['Some bad data, like, really bad, dude', 'Id 6']]

,(?!\s) matches all the commas which wouldn't be followed by a space character. Splitting according to the matched comma will give you the desired output.

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • I guess this works for my case, but you can also easily image cases where it's not so easy to add a quick modifier to the regex (like in this case the easy use of `!`). It still doesn't get at the idea of merely *using* a capture group to identify a property that the rest of the regex has, but then *not* matching or dropping the capture group itself. – ely Feb 09 '15 at 03:02
  • @Mr.F: "Identifying a property that the rest of the regex has, but then not matching that part" is what lookarounds are for. If what you want to split on is "the *next* character is X", then you don't want to split on X, and you should design your splitting regex to directly encode the fact that you want to split on positions where the next character is X. – BrenBarn Feb 09 '15 at 03:06
  • `!` isn't a modifier. `(?!..)` called a negative lookahead. – Avinash Raj Feb 09 '15 at 03:08
  • @BrenBarn I was looking for a general solution to the "the next character is X" property ... a solution that worked whether the regex-before-that-next-X-character was zero width or not. I'm looking for something that would work the same in my first example and second example, only needing to swap in the different boundary regex but then getting the same type of behavior in both cases. "It's not possible in `re`" is a valid answer, but I don't get the pushback about the question. – ely Feb 09 '15 at 03:08