2

I am new to using Pandas dataframes but have used Spark's dataframes a lot. Consider the following dataframe.

Name   Value   Title
mickey 20      wonderland
donald 10      welcome to donald's castle
minnie 86      Minnie mouse clubhouse

I want to retain only those rows where the "Name" is contained within "Title" ignoring case. So, in this case, the filtered dataframe should look like

Name   Value   Title
donald 10      welcome to donald's castle
minnie 86      Minnie mouse clubhouse

The row with Name = mickey was dropped.

In spark, I can create a dataframe df and then say df.filter($'Title'.lower().contains($'Name'.lower()))

Is there a simple way of expressing that in Pandas dataframes?

cs95
  • 379,657
  • 97
  • 704
  • 746
Nik
  • 5,515
  • 14
  • 49
  • 75

4 Answers4

3

String methods in pandas are inherently difficult to vectorize. I usually do this using a list comprehension:

df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]

     Name  Value                       Title
1  donald     10  welcome to donald's castle
2  minnie     86      Minnie mouse clubhouse

Most string methods can be sped up using list comprehensions are long are you aren't worried about NaNs and mixed types. See For loops with pandas - When should I care?.


If you need error handling, use a function with try-except handling. This is still faster.

def try_check(x, y):
    try:
        return y.lower() in x.lower()
    except AttributeError: 
        return False

df[[try_check(x, y) for x, y in zip(df['Title'], df['Name'])]]

     Name  Value                       Title
1  donald     10  welcome to donald's castle
2  minnie     86      Minnie mouse clubhouse
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Wow! Pandas is seriously limited in terms of expressibility. Coming from Spark world, Pandas DF manipulations look almost ugly. Thanks for the solution though. It looks good. Will accept the solution. – Nik Jan 29 '19 at 23:13
  • 2
    @Nik pandas is not 'ugly' : -) , we just need to follow the right logic behind it – BENY Jan 29 '19 at 23:16
2

Using numpy.core.chararray

s1=df.Title.str.upper().values.astype(str)
s2=df.Name.str.upper().values.astype(str)
df[np.core.chararray.find(s1,s2)!=-1]
Out[790]: 
     Name  Value                       Title
1  donald     10  welcome to donald's castle
2  minnie     86      Minnie mouse clubhouse
BENY
  • 317,841
  • 20
  • 164
  • 234
1

Heres another solution but it uses the .apply method, not sure how that holds in terms of speed, but this works and its pretty expressible.

df[df.apply(lambda x: x.Name.upper() in x.Title.upper(), axis=1) == True]


# Output
    Name    Value Title
1   donald  10    welcome to donald's castle
2   minnie  86    Minnie mouse clubhouse
Erfan
  • 40,971
  • 8
  • 66
  • 78
1

Here are a few more options, all based on this SO post (main requirement is to use "|".join(...))

Option 1 - df.query()

df_match = df.query("Title.str.lower().str.contains('|'.join(Name.str.lower()))")
print(df_match)
     Name  Value                       Title
1  donald     10  welcome to donald's castle
2  minnie     86      Minnie mouse clubhouse

option 2

print(df[df['Title'].str.lower().str.contains('|'.join(df['Name'].str.lower()))])
     Name  Value                       Title
1  donald     10  welcome to donald's castle
2  minnie     86      Minnie mouse clubhouse

option 3 - using NumPy where

from numpy import where
df['match'] = (
                where(df.Title.str.lower().str.contains(
                    '|'.join(df['Name'].str.lower()))
                , True, False)
                )
print(df[df['match']==True])
     Name  Value                       Title  match
1  donald     10  welcome to donald's castle   True
2  minnie     86      Minnie mouse clubhouse   True
edesz
  • 11,756
  • 22
  • 75
  • 123