17

I'm working with individual rows of pandas data frames, but I'm stumbling over coercion issues while indexing and inserting rows. Pandas seems to always want to coerce from a mixed int/float to all-float types, and I can't see any obvious controls on this behaviour.

For example, here is a simple data frame with a as int and b as float:

import pandas as pd
pd.__version__  # '0.25.2'

df = pd.DataFrame({'a': [1], 'b': [2.2]})
print(df)
#    a    b
# 0  1  2.2
print(df.dtypes)
# a      int64
# b    float64
# dtype: object

Here is a coercion issue while indexing one row:

print(df.loc[0])
# a    1.0
# b    2.2
# Name: 0, dtype: float64
print(dict(df.loc[0]))
# {'a': 1.0, 'b': 2.2}

And here is a coercion issue while inserting one row:

df.loc[1] = {'a': 5, 'b': 4.4}
print(df)
#      a    b
# 0  1.0  2.2
# 1  5.0  4.4
print(df.dtypes)
# a    float64
# b    float64
# dtype: object

In both instances, I want the a column to remain as an integer type, rather than being coerced to a float type.

Mike T
  • 41,085
  • 18
  • 152
  • 203
  • I found [this](https://github.com/pandas-dev/pandas/issues/11617), but I could not found if effectively the issue was solved. In the mean time I guess you could do: `df.loc[[0], df.columns]` – Dani Mesejo Oct 23 '19 at 23:51
  • 1
    Duplicates? [.loc indexing changes type](https://stackoverflow.com/q/43366763/7851470) & [Adding row to pandas DataFrame changes dtype](https://stackoverflow.com/q/22044766/7851470). – Georgy Nov 07 '19 at 13:58
  • Sounds like pd.DataFrame doesn't support type mixing on instantiation? https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html dtype param only supports a single type. `.read_[type]` supports multiple dtypes though... – Quentin Nov 08 '19 at 21:47

5 Answers5

4

After some digging, here are some terribly ugly workarounds. (A better answer will be accepted.)

A quirk found here is that non-numeric columns stops coercion, so here is how to index one row to a dict:

dict(df.assign(_='').loc[0].drop('_', axis=0))
# {'a': 1, 'b': 2.2}

And inserting a row can be done by creating a new data frame with one row:

df = df.append(pd.DataFrame({'a': 5, 'b': 4.4}, index=[1]))
print(df)
#    a    b
# 0  1  2.2
# 1  5  4.4

Both of these tricks are not optimised for large data frames, so I would greatly appreciate a better answer!

Mike T
  • 41,085
  • 18
  • 152
  • 203
  • You could always just coerce post append `df['a'] = df.a.astype(mytype)`... It's still dirty though and probably not efficient. – Quentin Nov 08 '19 at 21:50
  • `.astype()` is dangerous for float -> integer; it has no problem changing `1.1` to `1`, so you really need to be sure all of your values are 'integer-like' before doing it. Probably best to use `pd.to_numeric` with `downcast='integer'` – ALollz Nov 11 '19 at 16:26
3

Whenever you are getting data from dataframe or appending data to a dataframe and need to keep the data type same, avoid conversion to other internal structures which are not aware of the data types needed.

When you do df.loc[0] it converts to pd.Series,

>>> type(df.loc[0])
<class 'pandas.core.series.Series'>

And now, Series will only have a single dtype. Thus coercing int to float.

Instead keep structure as pd.DataFrame,

>>> type(df.loc[[0]])
<class 'pandas.core.frame.DataFrame'>

Select row needed as a frame and then convert to dict

>>> df.loc[[0]].to_dict(orient='records')
[{'a': 1, 'b': 2.2}]

Similarly, to add a new row, Use pandas pd.DataFrame.append function,

>>> df = df.append([{'a': 5, 'b': 4.4}]) # NOTE: To append as a row, use []
   a    b
0  1  2.2
0  5  4.4

The above will not cause type conversion,

>>> df.dtypes
a      int64
b    float64
dtype: object
Vishnudev Krishnadas
  • 10,679
  • 2
  • 23
  • 55
  • Wow had to read that second code block three times to get it. That is very subtle. This is much better than what I've done in the past... loop through the final dataframe and reassign the values with the correct data type (yes what I did is a horrible solution that really won't scale.). – VanBantam Nov 11 '19 at 18:28
2

The root of the problem is that

  1. The indexing of pandas dataframe returns a pandas series

We can see that:

type(df.loc[0])
# pandas.core.series.Series

And a series can only have one dtype, in your case either int64 or float64.

There are two workarounds come to my head:

print(df.loc[[0]])
# this will return a dataframe instead of series
# so the result will be
#    a    b
# 0  1  2.2

# but the dictionary is hard to read
print(dict(df.loc[[0]]))
# {'a': 0    1
# Name: a, dtype: int64, 'b': 0    2.2
# Name: b, dtype: float64}

or

print(df.astype(object).loc[0])
# this will change the type of value to object first and then print
# so the result will be
# a      1
# b    2.2
# Name: 0, dtype: object

print(dict(df.astype(object).loc[0]))
# in this way the dictionary is as expected
# {'a': 1, 'b': 2.2}
  1. When you append a dictionary to a dataframe, it will convert the dictionary to a Series first and then append. (So the same problem happens again)

https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L6973

if isinstance(other, dict):
    other = Series(other)

So your walkaround is actually a solid one, or else we could:

df.append(pd.Series({'a': 5, 'b': 4.4}, dtype=object, name=1))
#    a    b
# 0  1  2.2
# 1  5  4.4
Hongpei
  • 677
  • 3
  • 13
  • Good idea to use `object` data types! Another one is to create an object DataFrame from the beginning: `df = pd.DataFrame({'a': [1], 'b': [2.2]}, dtype=object)` – Mike T Nov 12 '19 at 00:21
1

A different approach with slight data manipulations:

Assume you have a list of dictionaries (or dataframes)

lod=[{'a': [1], 'b': [2.2]}, {'a': [5], 'b': [4.4]}]

where each dictionary represents a row (note the lists in the second dictionary). Then you can create a dataframe easily via:

pd.concat([pd.DataFrame(dct) for dct in lod])
   a    b
0  1  2.2
0  5  4.4

and you maintain the types of the columns. See concat

So if you have a dataframe and a list of dicts, you could just use

pd.concat([df] + [pd.DataFrame(dct) for dct in lod])
Quickbeam2k1
  • 5,287
  • 2
  • 26
  • 42
0

In the first case, you can work with the nullable integer data type. The Series selection doesn't coerce to float and values are placed in an object container. The dictionary is then properly created, with the underlying value stored as a np.int64.

df = pd.DataFrame({'a': [1], 'b': [2.2]})
df['a'] = df['a'].astype('Int64')

d = dict(df.loc[0])
#{'a': 1, 'b': 2.2}

type(d['a'])
#numpy.int64

With your syntax, this almost works for the second case too, but this upcasts to object, so not great:

df.loc[1] = {'a': 5, 'b': 4.4}
#   a    b
#0  1  2.2
#1  5  4.4

df.dtypes
#a     object
#b    float64
#dtype: object

However, we can make a small change to the syntax for adding a row at the end (with a RangeIndex) and now types are dealt with properly.

df = pd.DataFrame({'a': [1], 'b': [2.2]})
df['a'] = df['a'].astype('Int64')

df.loc[df.shape[0], :] = [5, 4.4]
#   a    b
#0  1  2.2
#1  5  4.4

df.dtypes
#a      Int64
#b    float64
#dtype: object
ALollz
  • 57,915
  • 7
  • 66
  • 89