3

I know how to delete rows based on simple criteria like in this stack overflow question, however, I need to delete rows using more complex criteria.

My situation: I have rows of data where each row has four columns containing numeric codes. I need to drop all rows that don't have at least one code with a leading digit of less than 5. I've currently got a function that I can use with dataframe.apply that creates a new column, 'keep', and populates it with 1 if it is a row to keep. I then do a second pass using that simple keep column to delete unwanted rows. What I'm looking for is a way to do this in a single pass without having to create a new column.

Example Data:

   a | b | c | d
0 145|567|999|876
1 999|876|543|543

In that data I would like to keep the first row because in column 'a' the leading digit is less than 5. The second row has no columns with a leading digit of less than 5, so that row needs to be dropped.

Community
  • 1
  • 1
Gregory Arenius
  • 2,904
  • 5
  • 26
  • 47

1 Answers1

4

This should work:

In [31]:
df[(df.apply(lambda x: x.str[0].astype(int))).lt(5).any(axis=1)]

Out[31]:
     a    b    c    d
0  145  567  999  876

So basically this takes the first character of each column using the vectorised str method, we cast this to an int, we then call lt which is less than row-wise to produce a boolean df, we then call any on the df row-wise to produce a boolean mask on the index which use to mask the df. So breaking the above down:

In [34]:
df.apply(lambda x: x.str[0].astype(int))

Out[34]:
   a  b  c  d
0  1  5  9  8
1  9  8  5  5

In [35]:    
df.apply(lambda x: x.str[0].astype(int)).lt(5)

Out[35]:
       a      b      c      d
0   True  False  False  False
1  False  False  False  False

In [37]:    
df.apply(lambda x: x.str[0].astype(int)).lt(5).any(axis=1)

Out[37]:
0     True
1    False
dtype: bool

EDIT

To handle NaN values you add a call to dropna:

In [39]:
t="""a,b,c,d
0,145,567,999,876
1,999,876,543,543
2,,324,344"""
df = pd.read_csv(io.StringIO(t),dtype=str)
df

Out[39]:
     a    b    c    d
0  145  567  999  876
1  999  876  543  543
2  NaN  324  344  NaN

In [44]:
df[(df.apply(lambda x: x.dropna().str[0].astype(int))).lt(5,axis=0).any(axis=1)]

Out[44]:
     a    b    c    d
0  145  567  999  876
2  NaN  324  344  NaN
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Thats awesome. I haven't used any to make a mask in that manner before. That will be helpful to me in a couple other places as well. Is there a way to handle missing data with this method? It chokes on NaNs. – Gregory Arenius May 21 '15 at 18:38
  • 1
    I'll update, basically you call `dropna` in the lambda – EdChum May 21 '15 at 18:41
  • No it won't see my edit, when you call `dropna()` on a series (which is what we're doing here when calling `apply` on a df) it drops an entry in the series not an entire row – EdChum May 21 '15 at 18:45
  • You're correct, of course, I replied before you had your update up. This answer works perfectly. Thank you for your explanation. – Gregory Arenius May 21 '15 at 18:48
  • The updated function with the dropna works for most cases but fails when all four columns are NaN. The reason is that the returned series has an index that doesn't match that of the original dataframe. To fix I added a reindex to the line of code: df[(df.apply(lambda x: x.dropna().str[0].astype(int))).lt(5,axis = 0).any(axis = 1).reindex(index = df.index, fill_value = False)] – Gregory Arenius May 22 '15 at 23:38