Difference between drop and select_dtypes in pandas

Question

What is the difference between these two methods to delete a row if the string 'something' is found in the column 'search'?

First method:

mydata = mydata.set_index("search")
mydata = mydata.drop("something", axis=0)

This method seems pretty straight forward and is understandable.

Second method:

mydata = mydata[~mydata.select_dtypes(['object']).eq('something').any(1)]

I don't really know how this method works. Where in this line is it specified to drop/delete the row? And why does it work with 'object' instead of 'search'? What does the "~" stand for? I just can't find it in the documentation.

I think I got it - more or less. "select_dtypes" searches for all rows with the string 'something' in the column and keeps them. The "~" reverses this statement. — TAN-C-F-OK, Nov 01 '18 at 10:03
No, that's incorrect, `select_dtypes` subsets your dataframe by series/column **type**. The subsequent method `eq` is the one that tests for equality. — jpp, Nov 01 '18 at 10:04

score 1 · Accepted Answer · answered Nov 01 '18 at 10:03

Your two methods are not identical. Let's look at the second method in parts.

Step 1: subset dataframe via `select_dtypes`

mydata.select_dtypes(['object']) filters your dataframe for only series with object dtype. You can extract the dtype of each series via mydata.dtypes. Typically, non-numeric series will have object dtype, which indicates a sequence of pointers, similar to list.

In this case, your two methods only align when series search is the only object dtype series.

Step 2: Test for equality via `eq`

Since Step 1 returns a dataframe, even if it only contains one series, pd.DataFrame.eq will return a dataframe of Boolean values.

Step 3: Test for any `True` value row-wise via `any`

Next your second method checks if any value is True row-wise (axis=1). Again, if your only object series is search, then this equates to the same as your first method.

If you have multiple object series, then your two methods may not align, as a row may be excluded due to another series being equal to 'something'.

Thanks. The big missing piece was that 'object' is a data type (str). Others would be 'int', 'float' and 'bool'. And just a little bonus question: 'object' can be mixed type? — TAN-C-F-OK, Nov 01 '18 at 10:14
Yes, see [this excellent answer](https://stackoverflow.com/a/21020411/9209546) to understand what `object` really means. — jpp, Nov 01 '18 at 10:15

Difference between drop and select_dtypes in pandas

1 Answers1

Step 1: subset dataframe via select_dtypes

Step 2: Test for equality via eq

Step 3: Test for any True value row-wise via any

Step 1: subset dataframe via `select_dtypes`

Step 2: Test for equality via `eq`

Step 3: Test for any `True` value row-wise via `any`