Keep elements with pattern in pandas series without converting them to list

Question

I have the following dataframe:

df = pd.DataFrame(["Air type:1, Space kind:2, water", "something, Space blu:3, somethingelse"], columns = ['A'])

and I want to create a new column that contains for each row all the elements that have a ":" in them. So for example in the first row I want to return "type:1, kind:2" and for the second row I want "blu:3". I managed by using a list comprehension in the following way:

df['new'] = [[y for y in x  if ":" in y] for x in df['A'].str.split(",")]

But my issue is that the new column contains list elements.

    A                                                       new
0   Air type:1, Space kind:2, water                         [Air type:1, Space kind:2]
1   something at the start:4, Space blu:3, somethingelse    [something at the start:4, Space blu:3]

I have not used Python a lot so I am not 100% whether I am missing a more Pandas specific way to do this. If there is one, more than happy to learn about it and use it. If this is a correct approach how can I convert the elements back into strings in order to do regexes on them? I tried How to concatenate items in a list to a single string? but this is not working as I would like it to.

Ch3steR · Accepted Answer · 2020-11-23T06:37:12.817

2

You can use pd.Series.str.findall here.

df['new'] = df['A'].str.findall('\w+:\w+')

                                 A               new
0            type:1, kind:2, water  [type:1, kind:2]
1  something, blu:3, somethingelse           [blu:3]

EDIT:

When there are multiple words then try

df['new'] = df['A'].str.findall('[^\s,][^:,]+:[^:,]+').str.join(', ')

                                      A                       new
0        Air type:1, Space kind:2, water  Air type:1, Space kind:2
1  something, Space blu:3, somethingelse               Space blu:3

edited Nov 23 '20 at 06:37

answered Nov 22 '20 at 19:18

Ch3steR

20,090
4
28
58

@User2321 Updated the answer. Modified the regex pattern to avoid the extra space but there may be a much better regex pattern than this. – Ch3steR Nov 23 '20 at 06:37
Thank you. Well based on the problem and the specifications the solution works, regex optimization is not a top priority for me :) – User2321 Nov 23 '20 at 06:41

score 1 · Answer 2 · answered Nov 22 '20 at 20:28

1

You can use findall with join:

import pandas as pd
df = pd.DataFrame(["type:1, kind:2, water", "something, blu:3, somethingelse"], columns = ['A'])
df['new'] = df['A'].str.findall(r'[^\s:,]+:[^\s,]+').str.join(', ')
df['new']
# => 0    type:1, kind:2
# => 1             blu:3

The regex matches

[^\s:,]+ - one or more chars other than whitespace, : and ,
: - a colon
[^\s,]+ - one or more chars other than whitespace and ,.

See the regex demo.

The .str.join(', ') concats all the found matches with ,+space.

answered Nov 22 '20 at 20:28

Wiktor Stribiżew

607,720
39
448
563

Thank you! How would you deal with cases where there is more than one word before the :. I have updated my code example (apologies for the confusion I simply forgot to put one of the double word examples there) – User2321 Nov 23 '20 at 06:16
@User2321 Ch3steR built upon my pattern. `r'[^\s:,][^:,]*:[^:,]+'` will do for the time being. – Wiktor Stribiżew Nov 23 '20 at 08:06
Well new day, new question: https://stackoverflow.com/questions/64981401/python-regex-to-pick-all-elements-that-dont-match-pattern – User2321 Nov 24 '20 at 06:41

Keep elements with pattern in pandas series without converting them to list

2 Answers2

Linked