1

I have the following dataframe:

df = pd.DataFrame(["Air type:1, Space kind:2, water", "something, Space blu:3, somethingelse"], columns = ['A'])

and I want to create a new column that contains for each row all the elements that have a ":" in them. So for example in the first row I want to return "type:1, kind:2" and for the second row I want "blu:3". I managed by using a list comprehension in the following way:

df['new'] = [[y for y in x  if ":" in y] for x in df['A'].str.split(",")]

But my issue is that the new column contains list elements.

    A                                                       new
0   Air type:1, Space kind:2, water                         [Air type:1, Space kind:2]
1   something at the start:4, Space blu:3, somethingelse    [something at the start:4, Space blu:3]

I have not used Python a lot so I am not 100% whether I am missing a more Pandas specific way to do this. If there is one, more than happy to learn about it and use it. If this is a correct approach how can I convert the elements back into strings in order to do regexes on them? I tried How to concatenate items in a list to a single string? but this is not working as I would like it to.

User2321
  • 2,952
  • 23
  • 46

2 Answers2

2

You can use pd.Series.str.findall here.

df['new'] = df['A'].str.findall('\w+:\w+')

                                 A               new
0            type:1, kind:2, water  [type:1, kind:2]
1  something, blu:3, somethingelse           [blu:3]

EDIT:

When there are multiple words then try

df['new'] = df['A'].str.findall('[^\s,][^:,]+:[^:,]+').str.join(', ')

                                      A                       new
0        Air type:1, Space kind:2, water  Air type:1, Space kind:2
1  something, Space blu:3, somethingelse               Space blu:3
Ch3steR
  • 20,090
  • 4
  • 28
  • 58
  • @User2321 Updated the answer. Modified the regex pattern to avoid the extra space but there may be a much better regex pattern than this. – Ch3steR Nov 23 '20 at 06:37
  • Thank you. Well based on the problem and the specifications the solution works, regex optimization is not a top priority for me :) – User2321 Nov 23 '20 at 06:41
1

You can use findall with join:

import pandas as pd
df = pd.DataFrame(["type:1, kind:2, water", "something, blu:3, somethingelse"], columns = ['A'])
df['new'] = df['A'].str.findall(r'[^\s:,]+:[^\s,]+').str.join(', ')
df['new']
# => 0    type:1, kind:2
# => 1             blu:3

The regex matches

  • [^\s:,]+ - one or more chars other than whitespace, : and ,
  • : - a colon
  • [^\s,]+ - one or more chars other than whitespace and ,.

See the regex demo.

The .str.join(', ') concats all the found matches with ,+space.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thank you! How would you deal with cases where there is more than one word before the :. I have updated my code example (apologies for the confusion I simply forgot to put one of the double word examples there) – User2321 Nov 23 '20 at 06:16
  • @User2321 Ch3steR built upon my pattern. `r'[^\s:,][^:,]*:[^:,]+'` will do for the time being. – Wiktor Stribiżew Nov 23 '20 at 08:06
  • Well new day, new question: https://stackoverflow.com/questions/64981401/python-regex-to-pick-all-elements-that-dont-match-pattern – User2321 Nov 24 '20 at 06:41