1

I have a Pandas DataFrame whose columns are labeled with Python tuples.

These column labeling tuples can have None in them.

When I attempt to add columns to a data frame using either of the following approaches, the None in the labeling tuples are implicitly converted to a numpy.nan.

Approach 1 - Add columns with the dataframe[ NewColumn ] = ... syntax

>>> import pandas
>>> df = pandas.DataFrame()
>>> column_label = ( 'foo', None )
>>> df[column_label] = [ 1, 2, 3 ]
>>> df
   (foo, nan)
0           1
1           2
2           3
>>> 
>>> df.columns
Index([(u'foo', nan)], dtype='object')
                ^^^
           Desired to be be None

Approach 2 - Add column with the pandas.DataFrame.insert

>>> import pandas
>>> df = pandas.DataFrame()
>>> df.insert( 0, ( 'foo', None ), [ 1, 2, 3 ] )
>>> df
   (foo, nan)
0           1
1           2
2           3
>>> df.columns
Index([(u'foo', nan)], dtype='object')
                ^^^
             Desired to be None

So - what is going on here?

Is there a way to add columns to an existing data frame with a label that is a tuple containing None using either the DataFrame[] or DataFrame.insert syntax?

(Curiously, if you send None containing tuple column labels directly into the DataFrame constructor, or you explicitly set the columns attribute with None containing tuples, the None is retained, e.g.:

df = pandas.DataFrame( [ 1, 2, 3 ], columns=[ ( 'foo', None )] )

gives a DataFrame with ( 'foo', None ) as a column, not ( 'foo', nan ).

Similarly doing: df.columns = [ ( 'foo', None ), ... ]

will set the first column label to ( 'foo', None ) ).

deadcode
  • 2,226
  • 1
  • 20
  • 29

1 Answers1

0

DataFrame columns and rows are different. DataFrame columns can be accessed by header name, so without more context it might not make sense to not use None, i.e. see how the 'foo' column is accessed below. There is also a optional index. If the index is left out it becomes consecutive integers.

import pandas
headers = ['foo', 'Nada']
foo = [(1,'uno'), (2,'dos'), (3, 'tres')]
indices = ['a', 'b', 'c']
df = pandas.DataFrame(foo, columns=headers, index=indices)
#     foo  Nada
# a    1   uno
# b    2   dos
# c    3  tres

df['foo'] # only foo column of DataFrame (indices are also shown)
# a    1
# b    2
# c    3

df.loc['b'] # the row at b
# Name: foo, dtype: int64
# foo       2
# Nada    dos
# Name: b, dtype: object

df.iloc[0] # the row at integer location 0
# foo       1
# Nada    uno
# Name: a, dtype: object

bar = ['one', 'two', 'three']
df['bar'] = bar # add a new column
#    foo  Nada    bar
# a    1   uno    one
# b    2   dos    two
# c    3  tres  three

Headers that contain tuples with None could prove buggy and difficult to use with Pandas. One approach could be to serialize the tuples into strings or string tuples for use in the headers as described in if/else in Python's list comprehension?. If they are later needed from the headers they could be deserialized.

column_label = ( 'foo', None )
headers = ['' if x is None else x for x in column_label] # serialize into strings
df = pandas.DataFrame(foo, columns=headers)
#    foo
# 0    1   uno
# 1    2   dos
# 2    3  tres
column_labels_were = tuple([x if x else None for x in df.columns]) # deserialize from strings, if x is false if x is ''
# ('foo', None)
Community
  • 1
  • 1
SpeedCoder5
  • 8,188
  • 6
  • 33
  • 34
  • 1
    I think everything you are saying is factually correct - but I have two requirements: 1. Label columns with a tuple (this appears to be allowed) 2. Have the column label tuples include None (this appears to be allowed, but various common syntaxes for adding columns are implicitly changing the labels). – deadcode Jan 04 '16 at 23:03
  • Updated answer with a method for serializing headers to strings. – SpeedCoder5 Jan 05 '16 at 16:20