2

Consider this simple setup

import pandas as pd

df = pd.DataFrame({'id' : [1,2,3],
                   'text' : ['stack-overflow',
                             'slack-overflow',
                             'smack-over']})
df
Out[9]: 
   id            text
0   1  stack-overflow
1   2  slack-overflow
2   3      smack-over

I have a given regex, and I would like to extract the longest match. I know I can use str.extractall to get all the matches, but how can I get the longest one efficiently (as a column df['mylongest'] in the dataframe)?

Of course, in this example the longest matches are overflow, overflow and smack.

df.text.str.findall(r'(\w+)')
Out[10]: 
0    [stack, overflow]
1    [slack, overflow]
2        [smack, over]
Name: text, dtype: object
ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235

2 Answers2

2

Let's map max to the result of str.findall. I use functools.partial to avoid lambdas.

from functools import partial

f = partial(max, key=len)
df['text'].str.findall(r'(\w+)').map(f)

0    overflow
1    overflow
2       smack
Name: text, dtype: object
cs95
  • 379,657
  • 97
  • 704
  • 746
  • thanks but I really dont understand the `partial` function. Is that a native pandas stuff? can we do that we a good ol lambda func? thanks!! – ℕʘʘḆḽḘ Apr 07 '19 at 23:08
  • @ℕʘʘḆḽḘ The alternative would be `df['text'].str.findall(r'(\w+)').map(lambda x: max(x, key=len))`. It's in the standard library, nothing to do with pandas. – cs95 Apr 07 '19 at 23:10
  • @ℕʘʘḆḽḘ I think you will find [Python: Why is functools.partial necessary?](https://stackoverflow.com/questions/3252228) interesting. – Wiktor Stribiżew Apr 08 '19 at 07:55
2

If you would like try something in pandas

s=df.text.str.extractall(r'(\w+)')[0]
s[s.str.len().eq(s.str.len().max(level=0),level=0)]
Out[51]: 
   match
0  1        overflow
1  1        overflow
2  0           smack
Name: 0, dtype: object
BENY
  • 317,841
  • 20
  • 164
  • 234