649

I have a pandas DataFrame and I want to delete rows from it where the length of the string in a particular column is greater than 2.

I expect to be able to do this (per this answer):

df[(len(df['column name']) < 2)]

but I just get the error:

KeyError: u'no item named False'

What am I doing wrong?

(Note: I know I can use df.dropna() to get rid of rows that contain any NaN, but I didn't see how to remove rows based on a conditional expression.)

cs95
  • 379,657
  • 97
  • 704
  • 746
sjs
  • 8,830
  • 3
  • 19
  • 19

6 Answers6

1497

To directly answer this question's original title "How to delete rows from a pandas DataFrame based on a conditional expression" (which I understand is not necessarily the OP's problem but could help other users coming across this question) one way to do this is to use the drop method:

df = df.drop(some labels)
df = df.drop(df[<some boolean condition>].index)

Example

To remove all rows where column 'score' is < 50:

df = df.drop(df[df.score < 50].index)

In place version (as pointed out in comments)

df.drop(df[df.score < 50].index, inplace=True)

Multiple conditions

(see Boolean Indexing)

The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses.

To remove all rows where column 'score' is < 50 and > 20

df = df.drop(df[(df.score < 50) & (df.score > 20)].index)
User
  • 62,498
  • 72
  • 186
  • 247
  • 55
    I just want to remark, that the drop function supports inplace replacement. I.e,. your solution is the same as df.drop(df[df.score < 50].index, inplace=True). Nevertheless didn't know the "index" trick. Helped me a lot – Quickbeam2k1 Jun 08 '16 at 10:59
  • 39
    Just want to point out that before you use this index trick you need to be sure that your index values are unique (or call `reset_index()`). I found this out the hard way when way to many rows got dropped from my dataframe. – Jay Apr 06 '17 at 21:03
  • 4
    how do I drop all rows where the column type is str? I want to only keep list column types. I have tried `test = df.drop(df[df['col1'].dtype == str].index)` but I get the error `KeyError: False` I have also tried `df.drop(df[df.col1.dtype == str].index)` and `df.drop(df[type(df.cleaned_norm_email) == str].index)` but nothing seems to work? Can anyone advise. Thanks! @User – PyRsquared May 11 '17 at 16:07
  • 2
    This is an old question but... @aquatically-challenged-fish is a lot faster than this one. Note that you compute `df[(df.score < 50) & (df.score > 20)]` as part of your answer. If you reversed this to do `df = df[(df.score >= 50) | (df.score <= 20)]` you would get your answer much faster. – Roobie Nuby Mar 26 '18 at 06:44
  • @Jay - Lets say you have a column with non unique data and wants to use that column as an index. Lets say I apply reset_index(). What happens to these rows with duplicated column data? – Nguai al Jan 12 '19 at 05:44
  • 1
    @RoobieNuby - they are not same condition. – Nguai al Jan 12 '19 at 05:47
  • 2
    @Nguaial No they are not the same. They are opposite. So instead of dropping, as suggested above, I suggest keeping, as in `df = df[...` rather than `df = df.drop(...` – Roobie Nuby Jan 12 '19 at 19:07
  • I added a time comparison to `drop` to the `df = df[...` answer. I agree it should be accepted instead given the speed. – Max Ghenis Mar 15 '19 at 03:09
  • @RoobieNuby they are not the same conditions, clarification follows: set1 = (1,2,3,4,5,6,7,8); set2 = (3,4,5,6,7,8,9,10), set1 < 5 & set2 > 4 = (4, 5) whereas set1 >= 5 | set2 <=4 = (3,4,5,6,7,8,9,10) – Coddy Jan 24 '20 at 18:11
  • What exactly does '.index'? – jimmy Jul 28 '20 at 16:22
  • 1
    @Guerra the .index will get you a list of the dataframe's index values. In this case, useful as a parameter to the drop function, so it deletes those specific "rows" – eespejel Sep 06 '20 at 04:49
  • 2
    `reset_index(inplace=True)` fixed my issue of rows being deleted for no reason. Thanks, @Jay! – Eli_B Sep 09 '20 at 09:20
  • Can I always reference a column by its name in the ```df.columnname``` format? – lpnorm Apr 26 '21 at 08:39
  • In my case, I wanted to remove lines from a dataframe based on values from a given column, only if those values happen to be in a certain range (e.g. from 30 to 40) , so I used ``df.drop(df[df['column-name'].between(30, 40)].index, inplace=True)`` – Diving Dec 20 '21 at 03:28
  • What is there is two index values which are same , it removes both regardless of one is true and one is false – Madan Apr 07 '22 at 09:56
271

When you do len(df['column name']) you are just getting one number, namely the number of rows in the DataFrame (i.e., the length of the column itself). If you want to apply len to each element in the column, use df['column name'].map(len). So try

df[df['column name'].map(len) < 2]
BrenBarn
  • 242,874
  • 37
  • 412
  • 384
  • 4
    I came up with a way using a list comprehension: `df[[(len(x) < 2) for x in df['column name']]]` but yours is much nicer. Thanks for your help! – sjs Dec 13 '12 at 04:17
  • 32
    In case someone needs a more complex comparion, a lambda can always be used. `df[df['column name'].map(lambda x: str(x)!=".")]` – 4lberto Jun 21 '18 at 15:26
  • 1
    For some reason, none of the other options have worked for me, except the one posted by @4lberto . I'm on `pandas 0.23.4` and python 3.6 – goelakash Aug 25 '18 at 13:22
  • 4
    I would add a `.copy()` at the end, in case you want to later edit this dataframe (for example, assigning new columns would raise the "A value is trying to be set on a copy of a slice from a DataFrame" warning. – PlasmaBinturong Mar 23 '19 at 16:38
182

You can assign the DataFrame to a filtered version of itself:

df = df[df.score > 50]

This is faster than drop:

%%timeit
test = pd.DataFrame({'x': np.random.randn(int(1e6))})
test = test[test.x < 0]
# 54.5 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
test = pd.DataFrame({'x': np.random.randn(int(1e6))})
test.drop(test[test.x > 0].index, inplace=True)
# 201 ms ± 17.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
test = pd.DataFrame({'x': np.random.randn(int(1e6))})
test = test.drop(test[test.x > 0].index)
# 194 ms ± 7.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Max Ghenis
  • 14,783
  • 16
  • 84
  • 132
somesingsomsing
  • 3,182
  • 4
  • 29
  • 46
  • How do I check for multiple column using or condition? – Piyush S. Wanare Jul 04 '18 at 14:04
  • https://stackoverflow.com/questions/13611065/efficient-way-to-apply-multiple-filters-to-pandas-dataframe-or-series – somesingsomsing Jul 05 '18 at 15:00
  • 10
    Be careful, if there still is a reference to the original unfiltered dataframe around then the filtered version is a view into (also known as slice of) the original dataframe, which will cause trouble if the filtered dataframe later needs to be modified (such as getting a column added, for example). In this case one might want to make an explicit copy, as in df=df[mask].copy(). Here is a code example showing the issue by rasing a warning: df = pd.DataFrame([(1,0),(2,3)]); df1 = df; df = df[df[0]>1]; df['b'] = 'some value'; – Uwe Mayer Nov 11 '21 at 22:12
21

I will expand on @User's generic solution to provide a drop free alternative. This is for folks directed here based on the question's title (not OP 's problem)

Say you want to delete all rows with negative values. One liner solution is:-

df = df[(df > 0).all(axis=1)]

Step by step Explanation:--

Let's generate a 5x5 random normal distribution data frame

np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,5), columns=list('ABCDE'))
      A         B         C         D         E
0  1.764052  0.400157  0.978738  2.240893  1.867558
1 -0.977278  0.950088 -0.151357 -0.103219  0.410599
2  0.144044  1.454274  0.761038  0.121675  0.443863
3  0.333674  1.494079 -0.205158  0.313068 -0.854096
4 -2.552990  0.653619  0.864436 -0.742165  2.269755

Let the condition be deleting negatives. A boolean df satisfying the condition:-

df > 0
      A     B      C      D      E
0   True  True   True   True   True
1  False  True  False  False   True
2   True  True   True   True   True
3   True  True  False   True  False
4  False  True   True  False   True

A boolean series for all rows satisfying the condition Note if any element in the row fails the condition the row is marked false

(df > 0).all(axis=1)
0     True
1    False
2     True
3    False
4    False
dtype: bool

Finally filter out rows from data frame based on the condition

df[(df > 0).all(axis=1)]
      A         B         C         D         E
0  1.764052  0.400157  0.978738  2.240893  1.867558
2  0.144044  1.454274  0.761038  0.121675  0.443863

You can assign it back to df to actually delete vs filter ing done above
df = df[(df > 0).all(axis=1)]

This can easily be extended to filter out rows containing NaN s (non numeric entries):-
df = df[(~df.isnull()).all(axis=1)]

This can also be simplified for cases like: Delete all rows where column E is negative

df = df[(df.E>0)]

I would like to end with some profiling stats on why @User's drop solution is slower than raw column based filtration:-

%timeit df_new = df[(df.E>0)]
345 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit dft.drop(dft[dft.E < 0].index, inplace=True)
890 µs ± 94.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

A column is basically a Series i.e a NumPy array, it can be indexed without any cost. For folks interested in how the underlying memory organization plays into execution speed here is a great Link on Speeding up Pandas:

Zakir
  • 2,222
  • 21
  • 31
13

In pandas you can do str.len with your boundary and using the Boolean result to filter it .

df[df['column name'].str.len().lt(2)]
BENY
  • 317,841
  • 20
  • 164
  • 234
5

If you want to drop rows of data frame on the basis of some complicated condition on the column value then writing that in the way shown above can be complicated. I have the following simpler solution which always works. Let us assume that you want to drop the column with 'header' so get that column in a list first.

text_data = df['name'].tolist()

now apply some function on the every element of the list and put that in a panda series:

text_length = pd.Series([func(t) for t in text_data])

in my case I was just trying to get the number of tokens:

text_length = pd.Series([len(t.split()) for t in text_data])

now add one extra column with the above series in the data frame:

df = df.assign(text_length = text_length .values)

now we can apply condition on the new column such as:

df = df[df.text_length  >  10]
def pass_filter(df, label, length, pass_type):

    text_data = df[label].tolist()

    text_length = pd.Series([len(t.split()) for t in text_data])

    df = df.assign(text_length = text_length .values)

    if pass_type == 'high':
        df = df[df.text_length  >  length]

    if pass_type == 'low':
        df = df[df.text_length  <  length]

    df = df.drop(columns=['text_length'])

    return df
Oliver
  • 1,576
  • 1
  • 17
  • 31