Python regex to pick all elements that don't match pattern

Question

I asked a similar question yesterday Keep elements with pattern in pandas series without converting them to list and now I am faced with the opposite problem.

I have a pandas dataframe:

import pandas as pd
df = pd.DataFrame(["Air type:1, Space kind:2, water, wood", "berries, something at the start:4, Space blu:3, somethingelse"], columns = ['A'])

and I want to pick all elements that don't have a ":" in them. What I tried is the following regex which seems to be working:

df['new'] = df.A.str.findall('(^|\s)([^:,]+)(,|$)')
    A                                                               new
0   Air type:1, Space kind:2, water, wood                           [( , water, ,), ( , wood, )]
1   berries, something at the start:4, Space blu:3, somethingelse   [(, berries, ,), ( , somethingelse, )]

If I understand this correctly, findall searched for 3 patterns (the ones that I have in parenthesis) and returned as many as it could find in tuples wrapped in a list. Is there a way to avoid this and simply return only the middle pattern? As in for the first row: water, wood for the second row: berries, somethingelse

I also tried the opposite approach:

df.A.str.replace('[^\s,][^:,]+:[^:,]+', '').str.replace('\s*,', '')

which seems to be working close to what I want (only the commas between the patterns are missing) but I am wondering if I am missing something that would make my life easier.

Try this: `df.A.str.findall(r'(?:^|,)([^:,]+)(?=,|$)')` – anubhava Nov 24 '20 at 06:55 — anubhava, Nov 24 '20 at 06:55

score 3 · Accepted Answer · answered Nov 24 '20 at 06:56

You may use this regex code:

>>> df['new'] = df.A.str.findall(r'(?:^|,)([^:,]+)(?=,|$)')
>>> print (df)
                                                   A                        new
0              Air type:1, Space kind:2, water, wood            [ water,  wood]
1  berries, something at the start:4, Space blu:3...  [berries,  somethingelse]

Regex used is:

(?:^|,): Match start or comma

([^:,]+): Match 1+ of any character that is not a : and not a ,
(?=,|$): Lookahead to assert that we have either a , or end of line ahead

One may also use `df.A.str.findall(r'(?:^|,)([^:,]+)(?=,|$)').apply(','.join)` to get comma delimited list of matched substrings — anubhava, Nov 24 '20 at 07:04

score 2 · Answer 2 · answered Nov 24 '20 at 06:53

2

You can use the following regex which use non-capturing group (?:) :

df.A.str.findall(r'(?:^|\s)([^:,]{2,})(?:,|$)')

This returns the following output:

Name: A, dtype: object
0               [water, wood]
1    [berries, somethingelse]

answered Nov 24 '20 at 06:53

Antoine Dubuis

4,974
1
15
29

This is quite good, the only issue is that in my actual data I don't have always two occurrences of the : pattern. Thank you though! – User2321 Nov 24 '20 at 07:02
`[^:,]{2,}` means that you will match only string that have at least 2 consecutive characters that are not `:` or `,`. This will ensure that you don't match string with single letter. – Antoine Dubuis Nov 24 '20 at 07:06
1

They are not bad, it is just a complex topic ;-). You're welcome, have a nice day – Antoine Dubuis Nov 24 '20 at 07:11

Python regex to pick all elements that don't match pattern

2 Answers2

Linked