7

Given a simple Pandas Series that contains some strings which can consist of more than one sentence:

In:
import pandas as pd
s = pd.Series(['This is a long text. It has multiple sentences.','Do you see? More than one sentence!','This one has only one sentence though.'])

Out:
0    This is a long text. It has multiple sentences.
1                Do you see? More than one sentence!
2             This one has only one sentence though.
dtype: object

I use pandas string method split and a regex-pattern to split each row into its single sentences (which produces unnecessary empty list elements - any suggestions on how to improve the regex?).

In:
s = s.str.split(r'([A-Z][^\.!?]*[\.!?])')

Out:
0    [, This is a long text.,  , It has multiple se...
1        [, Do you see?,  , More than one sentence!, ]
2         [, This one has only one sentence though., ]
dtype: object

This converts each row into lists of strings, each element holding one sentence.

Now, my goal is to use the string method contains to check each element in each row seperately to match a specific regex pattern and create a new Series accordingly which stores the returned boolean values, each signalizing if the regex matched on at least one of the list elements.

I would expect something like:

In:
s.str.contains('you')

Out:
0   False
1   True
2   False

<-- Row 0 does not contain 'you' in any of its elements, but row 1 does, while row 2 does not.

However, when doing the above, the return is

0   NaN
1   NaN
2   NaN
dtype: float64

I also tried a list comprehension which does not work:

result = [[x.str.contains('you') for x in y] for y in s]
AttributeError: 'str' object has no attribute 'str'

Any suggestions on how this can be achieved?

Dirk
  • 9,381
  • 17
  • 70
  • 98

1 Answers1

6

you can use python find() method

>>> s.apply(lambda x : any((i for i in x if i.find('you') >= 0)))
0    False
1     True
2    False
dtype: bool

I guess s.str.contains('you') is not working because elements of your series is not strings, but lists. But you can also do something like this:

>>> s.apply(lambda x: any(pd.Series(x).str.contains('you')))
0    False
1     True
2    False
Roman Pekar
  • 107,110
  • 28
  • 195
  • 197
  • Hooray, thanks! I prefer the latter one in my case though because `contains` lets you search with a regex while `find` expects a string. However, in simple cases, when no regex is needed, find will probably be faster I guess. – Dirk Dec 04 '14 at 17:53
  • 2
    @Dirk, just in case - you can use `re` module to find by regexp - https://docs.python.org/2/library/re.html#module-re – Roman Pekar Dec 04 '14 at 18:05