4

I would like to delete rows that contain only values that are less than 10 and greater than 25. My sample dataframe will look like this:

a   b   c  
1   2   3  
4   5   16  
11  24  22  
26  50  65  

Expected Output:

a   b   c  
1   2   3  
4   5   16   
26  50  65  

So if the row contains any value less than 10 or greater than 25, then the row will stay in dataframe, otherwise, it needs to be dropped.

Is there any way I can achieve this with Pandas instead of iterating through all the rows?

thisisbhavin
  • 344
  • 3
  • 15
Jaswanth Kumar
  • 3,531
  • 3
  • 23
  • 26
  • Good luck. I hope you get your rows taken care of. – Mad Physicist Jun 14 '17 at 18:52
  • 1
    Just as an FYI, this is not how you ask a question on SO. You should show some minimal effort on your part. If you have the resources to ask a question here, you have the resources to do a little searching yourself. – Mad Physicist Jun 14 '17 at 18:58
  • why are the first and last rows still in your expected output? this doesn't make any sense. – Paul H Jun 14 '17 at 19:02
  • Can you explain your deletion criteria better? Do you drop a row if **any** of the values are less than 10 and greater than 25? or **all** of the values? The example input and output doesn't make sense – Rakesh Adhikesavan Jun 14 '17 at 19:06
  • If the row has any one value which is less than 10 or greater than 25, then the row will stay in dataframe else it needs to be dropped. – Jaswanth Kumar Jun 14 '17 at 19:13
  • @PaulH Sorry for the confusion..just updated the question. – Jaswanth Kumar Jun 14 '17 at 19:17
  • It's clarified but the end of the question now contradicts the beginning. – Paul H Jun 14 '17 at 19:36

3 Answers3

6

You can call apply and return the results to a new column called 'Keep'. You can then use this column to drop rows that you don't need.

import pandas as pd
l = [[1,2,3],[4,5,6],[11,24,22],[26,50,65]]
df = pd.DataFrame(l, columns = ['a','b','c']) #Set up sample dataFrame

df['keep'] = df.apply(lambda row: sum(any([(x < 10) or (x > 25) for x in row])), axis = 1)

The any() function returns a generator. Calling sum(generator) simply returns the sum of all the results stored in the generator.

Check this on how any() works. Apply function still iterates over all the rows like a for loop, but the code looks cleaner this way. I cannot think of a way to do this without iterating over all the rows.

Output:

    a   b   c  keep
0   1   2   3     1
1   4   5   6     1
2  11  24  22     0
3  26  50  65     1


df = df[df['keep'] == 1] #Drop unwanted rows
Rakesh Adhikesavan
  • 11,966
  • 18
  • 51
  • 76
  • Great answer! I'm new to pandas and was wondering if this is more efficient to "just loop" over the dataframe. I don't know how panda's dataframe is implemented under the hood but this seems like iterating over it twice. Could you please point me to some explanation on that? – mimre Jun 14 '17 at 19:56
  • Apply function is just syntactic sugar for a loop. Even if you wrote a for loop to iterate over all rows, I am not sure if there is a way to drop rows on the fly. – Rakesh Adhikesavan Jun 14 '17 at 20:14
5

You can use pandas boolean indexing

dropped_df = df.loc[((df<10) | (df>25)).any(1)]
  • df<10 will return a boolean df
  • | is the OR operator
  • .any(1) returns any true element over the axis 1 (rows) see documentation
  • df.loc[] then filters the dataframe based on the boolean df
Prageeth Jayathissa
  • 1,798
  • 1
  • 10
  • 16
1

I really like using masking for stuff like this; it's clean so you can go back and read your code. It's faster than using .apply too which is effectively for looping. Also, it avoids setting by copy warnings.

This uses boolean indexing like Prageeth's answer. But the difference is I like how you can save the boolean index as a separate variable for re-use later. I often do that so I don't have to modify the original dataframe or create a new one and just use df[mask] wherever I want that cropped view of the dataframe.

df = pd.DataFrame(
    [[1,2,3],
    [4,5,16],
    [11,24,22],
    [26,50,65]],
    columns=['a','b','c']
)

#use a mask to create a fully indexed boolean dataframe, 
#which avoids the SettingWithCopyWarning:
#https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
mask = (df > 10) & (df < 25)
print(mask)
"""
       a      b      c
0  False  False  False
1  False  False   True
2   True   True   True
3  False  False  False
"""

print(df[mask])
"""
      a     b     c
0   NaN   NaN   NaN
1   NaN   NaN  16.0
2  11.0  24.0  22.0
3   NaN   NaN   NaN
"""

print(df[mask].dropna())
"""
      a     b     c
2  11.0  24.0  22.0
"""

#one neat things about using masking is you can invert them too with a '~'
print(~mask)
"""
       a      b      c
0   True   True   True
1   True   True  False
2  False  False  False
3   True   True   True
"""

print( df[~mask].dropna())
"""
      a     b     c
0   1.0   2.0   3.0
3  26.0  50.0  65.0
"""

#you can also combine masks
mask2 = mask & (df < 24)
print(mask2)
"""
       a      b      c
0  False  False  False
1  False  False   True
2   True  False  False
3  False  False  False
"""

#and the resulting dataframe (without dropping the rows that are nan or contain any false mask)
print(df[mask2])
"""
      a   b     c
0   NaN NaN   NaN
1   NaN NaN  16.0
2  11.0 NaN  22.0
3   NaN NaN   NaN
"""
kilozulu
  • 347
  • 1
  • 9