From this example:
>>> re.split('(\W)', 'foo/bar spam\neggs')
['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']
is there a straightforward way to associate the capture group with either the right or left portion of the split? E.g. using the same regex/capture group, but yielding:
['foo', '/bar', ' spam', '\neggs']
or optionally
['foo/', 'bar ', 'spam\n', 'eggs']
I'm sure you can achieve it by changing the actual regex, but that's not the point (and we could modify the example to make the matches more complicated, so that it's a real pain to be unable to just re-use them and push them to the right or left).
Unfortunately it looks like making it a non-capturing group just drops the corresponding characters from the match:
>>> re.split('(?:\W)', 'foo/bar spam\neggs')
['foo', 'bar', 'spam', 'eggs']
By way of another example, consider if you had some text from a misbehaved CSV file. Each line only has one actual comma to split by, but accidentally some lines also have a comma in one of the fields. Luckily, the non-splitting commas are always followed by a space.
csv_data = [
'Some good data,Id 5',
'Some bad data, like, really bad, dude,Id 6'
]
The goal in this case is to process this into:
[['Some good data', 'Id 5'],
['Some bad data, like, really bad, dude', 'Id 6']]
through the use of a simple re.split
.
Using map(lambda x: re.split(",(?:\S)", x), csv_data)
produces
[['Some good data', 'd 5'],
['Some bad data, like, really bad, dude', 'd 6']]
and using map(lambda x: re.split(",(\S)", x), csv_data)
produces
[['Some good data', 'I', 'd 5'],
['Some bad data, like, really bad, dude', 'I', 'd 6']]
So what is a generic approach to re.split
that would work the same for both of these cases? Basically something I could wrap in a function, like
def my_split(regex_chars, my_strs):
return map(lambda x: re.split(...regex_chars..., x), my_strs)
such that both
my_split(r'(\W)', ['foo/bar spam\neggs'])
and
my_split(r',(\S)', csv_data)
each returns the expected output as from above.
Note: It appears this is not possible in just re
, but could be possible with some mixture of regex
and re
based on whether the split is zero-width or not.