16

I have a .csv file of contact information that I import as a pandas data frame.

>>> import pandas as pd
>>> 
>>> df = pd.read_csv('data.csv')
>>> df.head()

  fName   lName                    email   title
0  John   Smith         jsmith@gmail.com     CEO
1   Joe   Schmo      jschmo@business.com  Bagger
2  Some  Person  some.person@hotmail.com   Clerk

After importing the data, I'd like to drop rows where one field contains one of several substrings in a list. For example:

to_drop = ['Clerk', 'Bagger']

for i in range(len(df)):
    for k in range(len(to_drop)):
        if to_drop[k] in df.title[i]:
            # some code to drop the rows from the data frame

df.to_csv("results.csv")

What is the preferred way to do this in Pandas? Should this even be a post-processing step, or is it preferred to filter this prior to writing to the data frame in the first place? My thought was that this would be easier to manipulate once in a data frame object.

Sidney VanNess
  • 173
  • 1
  • 1
  • 7

2 Answers2

36

Use isin and pass your list of terms to search for you can then negate the boolean mask using ~ and this will filter out those rows:

In [6]:

to_drop = ['Clerk', 'Bagger']
df[~df['title'].isin(to_drop)]
Out[6]:
  fName  lName             email title
0  John  Smith  jsmith@gmail.com   CEO

Another method is to join the terms so it becomes a regex and use the vectorised str.contains:

In [8]:

df[~df['title'].str.contains('|'.join(to_drop))]
Out[8]:
  fName  lName             email title
0  John  Smith  jsmith@gmail.com   CEO

IMO it will be easier and probably faster to perform the filtering as a post processing step because if you decide to filter whilst reading then you are iteratively growing the dataframe which is not efficient.

Alternatively you can read the csv in chunks, filter out the rows you don't want and append the chunks to your output csv

EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Maybe using a set would be an idea? – Padraic Cunningham Jul 27 '15 at 21:28
  • @PadraicCunningham you mean for the list of search terms, I guess so but I was hoping the OP is providing a list of unique terms unless you're referring to something else – EdChum Jul 27 '15 at 21:30
  • Yep, meant for the list of names. – Padraic Cunningham Jul 27 '15 at 21:30
  • Minor comment on this, which likely is from the manner in which I worded the question. I'd really like the code to drop anything where a term in to_drop is contained within the text of title. So, for example, if to_drop somehow happened to contain "Bag" it would drop row 1. How to modify to do loose matching? – Sidney VanNess Jul 28 '15 at 01:21
  • I don't understand the second part of the answer: *"Another method is to join the terms so it becomes a regex and use the vectorised `str.contains`"*: that is not what happens using the `join`. – gented Apr 12 '17 at 07:19
  • @GennaroTedesco the result of the `'|'.join` produces the string: `'Clerk|Bagger'` which is a regex pattern which looks for either of the terms 'Clerk' or 'Bagger' – EdChum Apr 12 '17 at 07:29
4

Another way using query

In [961]: to_drop = ['Clerk', 'Bagger']

In [962]: df.query('title not in @to_drop')
Out[962]:
  fName  lName             email title
0  John  Smith  jsmith@gmail.com   CEO
Zero
  • 74,117
  • 18
  • 147
  • 154