1

I am using python 3.6.

I have a pandas.core.frame.DataFrame and would like to filter the entire DataFrame based on if the column called "Closed Date" is not null. In other words, if it is null in the "Closed Date" column, then remove the whole row from the DataFrame.

My code right now is the following:

data = raw_data.ix[raw_data['Closed Date'].notnull()]

Though it gets the job done, I get an warming message saying the following:

C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

I tried this code:

data1 = raw_data.loc[raw_data.notnull(), 'Closed Date']

But get this error:

ValueError: Cannot index with multidimensional key

How do I fix this? Any suggestions?

cool_beans
  • 131
  • 1
  • 5
  • 15

1 Answers1

3

This should work for you:

data1 = raw_data.loc[raw_data['Closed Date'].notnull()]

.ix was very similar to the current .loc (which is why the correct .loc syntax is equivalent to what you were originally doing with .ix). The difference, according to this detailed answer is: "ix usually tries to behave like loc but falls back to behaving like iloc if a label is not present in the index"

Example:

Taking this dataframe as an example (let's call it raw_data):

   Closed Date    x
0          1.0  1.0
1          2.0  2.0
2          3.0  NaN
3          NaN  3.0
4          4.0  4.0

raw_data.notnull() returns this DataFrame:

   Closed Date      x
0         True   True
1         True   True
2         True  False
3        False   True
4         True   True

You can't index using .loc based on a dataframe of boolean values. However, when you do raw_data['Closed Date'].notnull(), you end up with a Series:

0     True
1     True
2     True
3    False
4     True

Which can be passed to .loc as a sort of "boolean filter" to apply onto your dataframe.

Alternate Solution

As pointed out by John Clemens, the same can be achieved with raw_data.dropna(subset=['Closed Date']). The documentation for the .dropna method outlines how this could be more flexible in some situations (for instance, allowing to drop rows or columns in which any or all values are NaN using the how argument, etc...)

sacuL
  • 49,704
  • 8
  • 81
  • 106
  • 1
    You could argue that `raw_data.dropna(subset=['Closed Date'])` is a little more explicit... (and can be extended to do multiple columns and has the `how` option). – Jon Clements Mar 07 '18 at 16:57
  • @JonClements I think that is a good point and addresses the XY-ness of the question. However, I like that this explains what went wrong – piRSquared Mar 07 '18 at 17:00
  • @JonClements True, that would result in the same, and might prove more flexible in some situations; the OP was asking specifically about `.loc` and `.iloc` though – sacuL Mar 07 '18 at 17:00
  • 2
    Indeed - which this does address... not saying it doesn't... Doesn't hurt to realise it's an alternative option that'll work on multiple columns and can operate in place if the OP needed. – Jon Clements Mar 07 '18 at 17:02
  • @sacul BTW, don't be shy to include jc's suggestion in your post. It'll make it more accessible to future readers (with credit of course). – piRSquared Mar 07 '18 at 17:04
  • @piRSquared, good to know, I would have, but I wasn't sure about the etiquette behind that... don't want to step on anyone's toes! – sacuL Mar 07 '18 at 17:08
  • jc posted the comment to be helpful. If we wanted an upvote, he would've posted an answer himself. IMO, I like to incorporate useful comments in my post and I always attribute the edit to the person who helped. My advice is to attempt to make your post useful and never be shy to give credit to anyone who helped make your post more useful. Simple don't be a jerk etiquette will work perfectly (-: – piRSquared Mar 07 '18 at 17:11