2

Appending two pandas dataframes has an unexpected behavior when one of the dataframes has a column with all null values (NaN) and the other one has boolean values at the same column. The corresponding column in the resulting (from appending) dataframe is typed as float64 and the boolean values are turned into ones and zeros based on their original boolean values. Example:

df1 = pd.DataFrame(data = [[1, 2 ,True], [10, 20, True]], columns=['a', 'b', 'c'])   
df1
    a   b     c
0   1   2  True
1  10  20  False 

df2 = pd.DataFrame(data = [[1,2], [10,20]], columns=['a', 'b'])  
df2['c'] = np.nan
df2
    a   b   c
0   1   2 NaN
1  10  20 NaN

Appending:

df1.append(df2)
    a   b    c
0   1   2  1.0
1  10  20  0.0
0   1   2  NaN
1  10  20  NaN

My workaround is to reset the typing of the column as bool, but this turns the NaN values to booleans:

appended_df = df1.append(df2)
appended_df
    a   b    c
0   1   2  1.0
1  10  20  0.0
0   1   2  NaN
1  10  20  NaN

appended_df['c'] = appended_df.c.astype(bool)
appended_df
    a   b      c
0   1   2   True
1  10  20  False
0   1   2   True
1  10  20   True

Unfortunately, the pandas append documentation doesn't refer to the problem, any idea why pandas has this behavior?

  • that's because you've specificly declared `.astype(bool)` in your append statement so it's set all the 1's and 0's as T/F values. – Umar.H Nov 13 '19 at 07:59
  • 1
    No, actually I am using ```.astype(bool)``` just for the workaround, this is irrelevant for the problem. – Anas Alzogbi Nov 13 '19 at 09:48

2 Answers2

1

Mixed types of elements in DataFrame column is not allowed, see this discussion Mixed types of elements in DataFrame's column

The type of np.nan is float, so all the boolean values are casted to float when appending. To avoid this, you could change the type of the 'c' column to 'object' using .astype():

df1['c'] = df1['c'].astype(dtype='object')
df2['c'] = df2['c'].astype(dtype='object')

Then the append command has the desired result. However, as stated in the discussion mentioned above, having multiple types in the same column is not recommended. If instead of np.nan you use None, which is the NoneType object, you don't need to go through the type definition yourself. For the difference between NaN (Not a Number) and None types, see What is the difference between NaN and None?

You should think of what the 'c' column really represents, and choose the dtype accordingly.

Oxana Verkholyak
  • 125
  • 1
  • 10
0

You need to use convert_dtypes, if you are using Pandas 1.0.0 and above. Refer link for description and use convert_dtypes

Solution code:

df1 = df1.convert_dtypes()
df1.append(df2)

print(df1)
curiousBrain
  • 39
  • 1
  • 7