How to select rows in Pandas dataframe based on string matching in multiple columns

Question

I don't think this exact question has been answered yet, so here goes.

I have a Pandas data frame, and I want to select all rows that contain a string in column A or column B.

Say the dataframe looks like this:

d = {'id':["1", "2", "3", "4"], 
     'title': ["Horses are good", "Cats are bad", "Frogs are nice", "Turkeys are the best"], 
     'description':["Horse epitome", "Cats bad but horses good", "Frog fancier", "Turkey tome, not about horses"],
     'tags':["horse, cat, frog, turkey", "horse, cat, frog, turkey", "horse, cat, frog, turkey", "horse, cat, frog, turkey"],
     'date':["2019-01-01", "2019-10-01", "2018-08-14", "2016-11-29"]}

dataframe  = pandas.DataFrame(d)

Which gives:

id              title                      description               tag           date
1   "Horses are good"                  "Horse epitome"       "horse, cat"    2019-01-01
2      "Cats are bad"                       "Cats bad"       "horse, cat"    2019-10-01
3    "Frogs are nice"      "Frog fancier, horses good"      "horse, frog"    2018-08-14
4   "Turkey are best"                    "Turkey tome"    "turkey, horse"    2016-11-29

Let's say I want to create a new dataframe containing rows with the string horse (ignoring capitalisation) in the column title OR the column description, but not in the column tag (or any other column).

The result should be (row 2 and 4 get dropped):

id                title                     description                 tag          date  
1     "Horses are good"                  "Horse epitome"       "horse, cat"    2019-01-01
3      "Frogs are nice"      "Frog fancier, horses good"      "horse, frog"    2018-08-14

I have seen a few answers for one column, such as something like:

dataframe[dataframe['title'].str.contains('horse')]

But I am not sure (1) how to add multiple columns to this statement and (2) how to modify it with something like string.lower() to remove capitals in the column values for the string match.

Thanks in advance!

jezrael · Accepted Answer · 2019-10-25T13:58:05.567

If want specify columns for test one possible solution is join all columns and then test with Series.str.contains and case=False:

s = dataframe['title'] + dataframe['description']
df = dataframe[s.str.contains('horse', case=False)]

Or create conditions for each column and chain them by bitwise OR with |:

df = dataframe[dataframe['title'].str.contains('horse', case=False) | 
               dataframe['description'].str.contains('horse', case=False)]

Also if want specify column column for not test chain solution with bitwise AND with invert condition by ~ for NOT MATCH:

df = dataframe[s.str.contains('horse', case=False) &
               ~dataframe['tags'].str.contains('horse', case=False)]

For second solution add () around all columns with chained by OR:

df = dataframe[(dataframe['title'].str.contains('horse', case=False) | 
               dataframe['description'].str.contains('horse', case=False)) &
              ~dataframe['tags'].str.contains('horse', case=False)]]

EDIT:

Like @WeNYoBen commented you can add DataFrame.copy to end for prevent SettingWithCopyWarning like:

s = dataframe['title'] + dataframe['description']
df = dataframe[s.str.contains('horse', case=False)].copy()

score 1 · Answer 2 · answered Oct 25 '19 at 13:49

You can use a "logical or" operator | on the series corresponding to each column:

filtered = df[df['title'].str.contains('horse', case=False) | 
              df['description'].str.contains('horse', case=False)]

If you have many columns, you could use a reduce operation:

import functools
import operator

colnames = ['title', 'description']
mask = functools.reduce(operator.or_, (df[col].str.contains('horse', case=False) for col in colnames))
filtered = df[mask]

Please remove first solution, because in my answer. – jezrael Oct 25 '19 at 13:49 — jezrael, Oct 25 '19 at 13:49

How to select rows in Pandas dataframe based on string matching in multiple columns

2 Answers2