3

I've run into an issue trying to drop a nan column from a table.

Here's the example that works as expected:

import pandas as pd
import numpy as np

df1 = pd.DataFrame([[1, 2, 3], [4, 5, 6]], 
                    columns=['A', 'B', 'C'], 
                    index=['Foo', 'Bar'])

mapping1 = pd.DataFrame([['a', 'x'], ['b', 'y']], 
                        index=['A', 'B'], 
                        columns=['Test', 'Control'])

# rename the columns using the mapping file
df1.columns = mapping1.loc[df1.columns, 'Test']

From here we see that the C column in df1 doesn't have an entry in the mapping file, and so that header is replaced with a nan.

# drop the nan column
df1.drop(np.nan, axis=1)

In this situation, calling np.nan finds the final header and drops it.

However, in the situation below, the df.drop does not work:

# set up table
sample1 = np.random.randint(0, 10, size=3)
sample2 = np.random.randint(0, 5, size=3)
df2 = pd.DataFrame([sample1, sample2], 
                   index=['sample1', 'sample2'], 
                   columns=range(3))
mapping2 = pd.DataFrame(['foo']*2, index=range(2), 
                        columns=['test'])

# assign columns using mapping file
df2.columns = mapping2.loc[df2.columns, 'test']

# try and drop the nan column
df2.drop(np.nan, axis=1)

And the nan column remains.

El Developer
  • 3,345
  • 1
  • 21
  • 40
lkursell
  • 31
  • 3

2 Answers2

3

This may be an answer (from https://stackoverflow.com/a/16629125/5717589):

When index is unique, pandas use a hashtable to map key to value. When index is non-unique and sorted, pandas use binary search, when index is random ordered pandas need to check all the keys in the index.

So, if entries are unique, np.nan gets hashed I think. In a non-unique cases, pandas compares values, but:

np.nan == np.nan
Out[1]: False

Update

I guess it's impossible to access a NaN column by label. But it's doable by index position. Here is a workaround for dropping columns with null labels:

notnull_col_idx = np.arange(len(df.columns))[~pd.isnull(df.columns)]
df = df.iloc[:, notnull_col_idx]
Community
  • 1
  • 1
ptrj
  • 5,152
  • 18
  • 31
0

Hmmm... this might be considered a bug but it seems like this problem occurs if your columns are labeled with the same label, in this case as foo. If I switch up the labels, the issue disappears:

mapping2 = pd.DataFrame(['foo','boo'], index=range(2), 
                        columns=['test'])

I also attempted to call the columns by their index positions and the problem still occurs:

# try and drop the nan column
df2.drop(df2.columns[[2]], axis=1)

Out[176]:
test    foo foo nan
sample1 4   4   4
sample2 4   0   1

But after altering the 2nd column label to something other than foo, the problem resolves itself. My best piece of advice is to have unique column labels.

Additional info: So this also occurs when there are multiple nan columns as well...

Scratch'N'Purr
  • 9,959
  • 2
  • 35
  • 51
  • There are a few ways to deal with it, like renaming the final column to a non-nan and filtering the columns based on notnull(), but using drop is the more direct, and what I wanted to investigate. Glad you also found that index position doesn't work as well – lkursell Apr 29 '16 at 19:11