38

When you merge two indexed dataframes on certain values using 'outer' merge, python/pandas automatically adds Null (NaN) values to the fields it could not match on. This is normal behaviour, but it changes the data type and you have to restate what data types the columns should have.

fillna() or dropna() do not seem to preserve data types immediately after the merge. Do I need a table structure in place?

Typically I would run numpy np.where(field.isnull() etc) but that means running for all columns.

Is there a workaround to this?

Jeff
  • 551
  • 1
  • 5
  • 19
  • I think some example would help clarify what you want to achieve. Sometimes you can't change a type back, for example from float to int, because an int column can't contain NaN. And if all NaNs are immediately dropped, then why to use 'outer'? – ptrj Apr 20 '16 at 13:41

4 Answers4

30

This should really only be an issue with bool or int dtypes. float, object and datetime64[ns] can already hold NaN or NaT without changing the type.

Because of this, I'd recommend using the new nullable dtypes. You can use Int64 for your integer and 'boolean' for your Boolean columns. Both of these now support missing values with <NA>: pandas._libs.missing.NAType

import pandas as pd

df = pd.DataFrame({'a': [1]*6, 'b': [1, 2]*3, 'c': range(6)})
df2 = pd.DataFrame({'d': [1, 2], 'e': [True, False]})

df2['d'] = df2['d'].astype('Int64')
df2['e'] = df2['e'].astype('boolean')
df2.dtypes
#d      Int64
#e    boolean
#dtype: object

df.join(df2)
#   a  b  c     d      e
#0  1  1  0     1   True
#1  1  2  1     2  False
#2  1  1  2  <NA>   <NA>
#3  1  2  3  <NA>   <NA>
#4  1  1  4  <NA>   <NA>
#5  1  2  5  <NA>   <NA>

df.join(df2).dtypes
#a      int64
#b      int64
#c      int64
#d      Int64    <- dtype preserved
#e    boolean    <- dtype preserved

With Int64/Bool64 the fill value remains true to what you specify and the column is only upcast if you fill with a value incapable of fitting in the current dtype.

ALollz
  • 57,915
  • 7
  • 66
  • 89
  • I think you missed the point of `fillna`. You'll never replace that with a real value from your data like the float in your example. The value is just a sentinel so you can know which values were filled in. – hume Dec 05 '19 at 05:47
  • 2
    @hume I disagree, `fillna` is peripheral to the problem. The OP had an issue where `int` columns turned to `float` because of missing values, as `NaN` is a float and `pandas` then upcasts the column. IMO, the sloppy solution is to use some `fillna` with a sentinel value (-999); however that is highly error prone, especially when doing math like `sum` or `mean`. With `Int64` there's no need for a magic integer that indicates the value is null because the `Int64` column stores the numbers as integers **and** can store `NaN`, thus preserving the datatype after the merge. That was the question. – ALollz Dec 05 '19 at 10:57
  • You said in the solution that "in the other solutions if you do .fillna(-1.72) you may get an unwanted answer as you call int(-1.72)," but that will never happen. – hume Dec 05 '19 at 16:49
  • I think we're talking past each other. I was suggesting an edit that the concern about data loss going from float to int in the section I quoted will not happen. Int64 is a good solution if your data is bool/int as you mention. – hume Dec 05 '19 at 17:30
  • 1
    Can you comment why you first cast to `int` and then to `Int64` and not directly to `Int64`? – Stefan Feb 05 '21 at 13:07
  • 1
    @Stefan good catch. This solution is a bit old and at the time Int64 was just introduced and still experimental. IIRC you couldn't directly cast True - > 1 as an Int64 because it threw an error so the astype('int').astype('Int64') made it True -> 1 as an int -> 1 as Int64, but it seems that they've since added that functionality. In fact, there wasn't even a nullable Boolean type at the time, so I've updated the solution so that you can properly handle Bool columns today. – ALollz Feb 05 '21 at 15:46
7

I don't think there's any really elegant/efficient way to do it. You could do it by tracking the original datatypes and then casting the columns after the merge, like this:

import pandas as pd

# all types are originally ints
df = pd.DataFrame({'a': [1]*10, 'b': [1, 2] * 5, 'c': range(10)})
df2 = pd.DataFrame({'e': [1, 1], 'd': [1, 2]})

# track the original dtypes
orig = df.dtypes.to_dict()
orig.update(df2.dtypes.to_dict())

# join the dataframe
joined = df.join(df2, how='outer')

# columns with nans are now float dtype
print joined.dtypes

# replace nans with suitable int value
joined.fillna(-1, inplace=True)

# re-cast the columns as their original dtype
joined_orig_types = joined.apply(lambda x: x.astype(orig[x.name]))

print joined_orig_types.dtypes
hume
  • 2,413
  • 19
  • 21
  • With big enough integer values, precision loss is a concern that is not addressed by this solution. – masaers Oct 22 '19 at 08:58
  • Not super relevant. Only happens if `pandas` coerced those values into a `float` for you when it created the `NaN`s and your integers were huge. Casting as the nullable integer types ahead of time will handle this, but you still need to know and anticipate that. https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html#integer-na – hume Dec 05 '19 at 05:52
  • I think what you mean is "Not likely to be relevant", as it does indeed get "super relevant" the second you hit the, granted not very frequent, situation where it kicks in. – masaers Dec 09 '19 at 03:47
6

As of pandas 1.0.0 I believe you have another option, which is to first use convert_dtypes. This converts the dataframe columns to dtypes that support pd.NA, avoiding the issues with NaN. This preserves the bool values as well unlike this answer.

...

df = pd.DataFrame({'a': [1]*6, 'b': [1, 2]*3, 'c': range(6)})
df2 = pd.DataFrame({'d': [1,2], 'e': [True, False]})
df = df.convert_dtypes()
df2 = df2.convert_dtypes()
print(df.join(df2))

#   a  b  c     d      e
#0  1  1  0     1   True
#1  1  2  1     2  False
#2  1  1  2  <NA>   <NA>
#3  1  2  3  <NA>   <NA>
#4  1  1  4  <NA>   <NA>
#5  1  2  5  <NA>   <NA>
totalhack
  • 2,298
  • 17
  • 23
  • 1
    Golden answer - thank you totalhack! – jtlz2 May 08 '22 at 19:07
  • Very good answer. Should be the accepted solution when using newer pandas versions. Since pandas version 2.0 you can even choose the `dtype_backend` between `numpy_nullable` and `pyarrow`. See the updated documentation at https://pandas.pydata.org/pandas-docs/version/2.0/reference/api/pandas.Series.convert_dtypes.html – macKaiver Apr 13 '23 at 08:45
3

Or you can just do a concat/append on dtypes of both dfs and applyastype():

joined = df.join(df2, how='outer').fillna(-1).astype(pd.concat([df.dtypes,df2.dtypes]))
#or joined = df.join(df2, how='outer').fillna(-1).astype(df.dtypes.append(df2.dtypes))
print(joined)

   a  b  c  e  d
0  1  1  0  1  1
1  1  2  1  1  2
2  1  1  2 -1 -1
3  1  2  3 -1 -1
4  1  1  4 -1 -1
5  1  2  5 -1 -1
6  1  1  6 -1 -1
7  1  2  7 -1 -1
8  1  1  8 -1 -1
9  1  2  9 -1 -1
anky
  • 74,114
  • 11
  • 41
  • 70