2

I asked a similar question yesterday Keep elements with pattern in pandas series without converting them to list and now I am faced with the opposite problem.

I have a pandas dataframe:

import pandas as pd
df = pd.DataFrame(["Air type:1, Space kind:2, water, wood", "berries, something at the start:4, Space blu:3, somethingelse"], columns = ['A'])

and I want to pick all elements that don't have a ":" in them. What I tried is the following regex which seems to be working:

df['new'] = df.A.str.findall('(^|\s)([^:,]+)(,|$)')
    A                                                               new
0   Air type:1, Space kind:2, water, wood                           [( , water, ,), ( , wood, )]
1   berries, something at the start:4, Space blu:3, somethingelse   [(, berries, ,), ( , somethingelse, )]

If I understand this correctly, findall searched for 3 patterns (the ones that I have in parenthesis) and returned as many as it could find in tuples wrapped in a list. Is there a way to avoid this and simply return only the middle pattern? As in for the first row: water, wood for the second row: berries, somethingelse

I also tried the opposite approach:

df.A.str.replace('[^\s,][^:,]+:[^:,]+', '').str.replace('\s*,', '')

which seems to be working close to what I want (only the commas between the patterns are missing) but I am wondering if I am missing something that would make my life easier.

User2321
  • 2,952
  • 23
  • 46

2 Answers2

3

You may use this regex code:

>>> df['new'] = df.A.str.findall(r'(?:^|,)([^:,]+)(?=,|$)')
>>> print (df)
                                                   A                        new
0              Air type:1, Space kind:2, water, wood            [ water,  wood]
1  berries, something at the start:4, Space blu:3...  [berries,  somethingelse]

Regex used is:

(?:^|,): Match start or comma

  • ([^:,]+): Match 1+ of any character that is not a : and not a ,
  • (?=,|$): Lookahead to assert that we have either a , or end of line ahead
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 2
    One may also use `df.A.str.findall(r'(?:^|,)([^:,]+)(?=,|$)').apply(','.join)` to get comma delimited list of matched substrings – anubhava Nov 24 '20 at 07:04
2

You can use the following regex which use non-capturing group (?:) :

df.A.str.findall(r'(?:^|\s)([^:,]{2,})(?:,|$)')

This returns the following output:

Name: A, dtype: object
0               [water, wood]
1    [berries, somethingelse]
Antoine Dubuis
  • 4,974
  • 1
  • 15
  • 29
  • This is quite good, the only issue is that in my actual data I don't have always two occurrences of the : pattern. Thank you though! – User2321 Nov 24 '20 at 07:02
  • `[^:,]{2,}` means that you will match only string that have at least 2 consecutive characters that are not `:` or `,`. This will ensure that you don't match string with single letter. – Antoine Dubuis Nov 24 '20 at 07:06
  • 1
    They are not bad, it is just a complex topic ;-). You're welcome, have a nice day – Antoine Dubuis Nov 24 '20 at 07:11