363

I read data from a .csv file to a Pandas dataframe as below. For one of the columns, namely id, I want to specify the column type as int. The problem is the id series has missing/empty values.

When I try to cast the id column to integer while reading the .csv, I get:

df= pd.read_csv("data.csv", dtype={'id': int}) 
error: Integer column has NA values

Alternatively, I tried to convert the column type after reading as below, but this time I get:

df= pd.read_csv("data.csv") 
df[['id']] = df[['id']].astype(int)
error: Cannot convert NA to integer

How can I tackle this?

denis
  • 21,378
  • 10
  • 65
  • 88
Zhubarb
  • 11,432
  • 18
  • 75
  • 114
  • Could you post the content of your file? – Alvaro Fuentes Jan 22 '14 at 15:56
  • @xndrme, the file itself is too large. I will see if I can create a small test case. But essentially the situation is that the `id` column has many integer values and some empty/missing cells. – Zhubarb Jan 22 '14 at 16:00
  • 5
    I think that integer values cannot be converted or stored in a series/dataframe if there are missing/NaN values. This I think is to do with numpy compatibility (I'm guessing here), if you want missing value compatibility then I would store the values as floats – EdChum Jan 22 '14 at 16:14
  • 1
    see here: http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions; you must have a float dtype when u have missing values (or technically object dtype but that is inefficient); what is your goal of using int type? – Jeff Jan 22 '14 at 16:16
  • FYI, if you don't specify a dtype, then pandas will infer float for the column, no conversion needed. – Jeff Jan 22 '14 at 16:26
  • 8
    I believe this is a NumPy issue, not specific to Pandas. It's a shame since there are so many cases when having an int type that allows for the possibility of null values is much more efficient than a large column of floats. – ely Jan 22 '14 at 17:44
  • 1
    I have a problem with this too. I have multiple dataframes which I want to merge based on a string representation of several "integer" columns. However, when one of those integer columns has a np.nan, the string casting produces a ".0", which throws off the merge. Just makes things slightly more complicated, would be nice if there was simple work-around. – dermen Jul 11 '15 at 03:52
  • 2
    @Rhubarb, Optional Nullable Integer Support is now officially added on pandas 0.24.0 - finally :) - please find an updated answer bellow. [pandas 0.24.x release notes](https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support) – mork Jan 25 '19 at 17:14

29 Answers29

334

In version 0.24.+ pandas has gained the ability to hold integer dtypes with missing values.

Nullable Integer Data Type.

Pandas can represent integer data with possibly missing values using arrays.IntegerArray. This is an extension types implemented within pandas. It is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into array() or Series:

arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
pd.Series(arr)

0      1
1      2
2    NaN
dtype: Int64

For convert column to nullable integers use:

df['myCol'] = df['myCol'].astype('Int64')
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
260

The lack of NaN rep in integer columns is a pandas "gotcha".

The usual workaround is to simply use floats.

agsimmons
  • 3
  • 3
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • 30
    Are there any other workarounds besides treating them like floats? – NumenorForLife May 14 '15 at 23:26
  • 5
    @jsc123 you can use the object dtype. This comes with a small health warning but for the most part works well. – Andy Hayden May 19 '15 at 15:16
  • 1
    Can you provide an example of how to use object dtype? I've been looking through the pandas docs and googling, and I've read it's the recommended method. But, I haven't found an example of how to use the object dtype. – MikeyE Aug 15 '16 at 03:23
  • 66
    In v0.24, you can now do `df = df.astype(pd.Int32Dtype())` (to convert the entire dataFrame, or) `df['col'] = df['col'].astype(pd.Int32Dtype())`. Other accepted nullable integer types are `pd.Int16Dtype` and `pd.Int64Dtype`. Pick your poison. – cs95 Apr 02 '19 at 07:56
  • 2
    It is NaN value but isnan checking doesn't work at all :( – Winston Jul 31 '19 at 09:48
  • See https://stackoverflow.com/questions/58029359/pandas-convert-column-to-int-and-coerce-nan – PatrickT Oct 25 '21 at 05:36
  • @cs95 I am getting erro `object cannot be converted to an IntegerDtype` – Henrique Brisola Dec 06 '21 at 16:59
76

My use case is munging data prior to loading into a DB table:

df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = df[col].astype(str)
df[col] = df[col].replace('-1', np.nan)

Remove NaNs, convert to int, convert to str and then reinsert NANs.

It's not pretty but it gets the job done!

hibernado
  • 1,690
  • 1
  • 18
  • 19
  • 2
    I have been pulling my hair out trying to load serial numbers where some are null and the rest are floats, this saved me. – Chris Decker Jan 15 '19 at 17:51
  • 3
    The OP wants a column of integers. Converting it to string does not meet the condition. – Rishab Gupta Feb 21 '19 at 01:33
  • 5
    Works only if col doesn't already have -1. Otherwise, it will mess with the data – Sharvari Gc Oct 10 '19 at 04:55
  • 1
    then how to get back to int..?? – abdoulsn Jan 23 '20 at 09:48
  • This produces a column of strings!! For a solution with current versions of `pandas`, see https://stackoverflow.com/questions/58029359/pandas-convert-column-to-int-and-coerce-nan – PatrickT Oct 25 '21 at 05:39
  • Use case here for this answer is trying to load to a DB - so potentially writing to csv then bulk insert - in this case, forcing the int to string then writing it stops SQL complaining that e.g. 10.0 is not an int and can't be loaded. But not a solution for all cases. – tim654321 Jun 27 '23 at 23:59
13

Whether your pandas series is object datatype or simply float datatype the below method will work

df = pd.read_csv("data.csv") 
df['id'] = df['id'].astype(float).astype('Int64')
Abhishek Bhatia
  • 547
  • 4
  • 11
12

It is now possible to create a pandas column containing NaNs as dtype int, since it is now officially added on pandas 0.24.0

pandas 0.24.x release notes Quote: "Pandas has gained the ability to hold integer dtypes with missing values

mork
  • 1,747
  • 21
  • 23
7

I had the problem a few weeks ago with a few discrete features which were formatted as 'object'. This solution seemed to work.

for col in discrete:
    df[col] = pd.to_numeric(df[col],errors='coerce').astype(pd.Int64Dtype())
Kamil
  • 275
  • 3
  • 8
6

If you absolutely want to combine integers and NaNs in a column, you can use the 'object' data type:

df['col'] = (
    df['col'].fillna(0)
    .astype(int)
    .astype(object)
    .where(df['col'].notnull())
)

This will replace NaNs with an integer (doesn't matter which), convert to int, convert to object and finally reinsert NaNs.

jmenglund
  • 69
  • 1
  • 4
6

As of Pandas 1.0.0 you can now use pandas.NA values. This does not force integer columns with missing values to be floats.

When reading in your data all you have to do is:

df= pd.read_csv("data.csv", dtype={'id': 'Int64'})  

Notice the 'Int64' is surrounded by quotes and the I is capitalized. This distinguishes Panda's 'Int64' from numpy's int64.

As a side note, this will also work with .astype()

df['id'] = df['id'].astype('Int64')

Documentation here https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

Bradon
  • 213
  • 2
  • 8
5

You could use .dropna() if it is OK to drop the rows with the NaN values.

df = df.dropna(subset=['id'])

Alternatively, use .fillna() and .astype() to replace the NaN with values and convert them to int.

I ran into this problem when processing a CSV file with large integers, while some of them were missing (NaN). Using float as the type was not an option, because I might loose the precision.

My solution was to use str as the intermediate type. Then you can convert the string to int as you please later in the code. I replaced NaN with 0, but you could choose any value.

df = pd.read_csv(filename, dtype={'id':str})
df["id"] = df["id"].fillna("0").astype(int)

For the illustration, here is an example how floats may loose the precision:

s = "12345678901234567890"
f = float(s)
i = int(f)
i2 = int(s)
print (f, i, i2)

And the output is:

1.2345678901234567e+19 12345678901234567168 12345678901234567890
elomage
  • 4,334
  • 2
  • 27
  • 23
3

If you can modify your stored data, use a sentinel value for missing id. A common use case, inferred by the column name, being that id is an integer, strictly greater than zero, you could use 0 as a sentinel value so that you can write

if row['id']:
   regular_process(row)
else:
   special_process(row)
gboffi
  • 22,939
  • 8
  • 54
  • 85
3

Most solutions here tell you how to use a placeholder integer to represent nulls. That approach isn't helpful if you're uncertain that integer won't show up in your source data though. My method with will format floats without their decimal values and convert nulls to None's. The result is an object datatype that will look like an integer field with null values when loaded into a CSV.

keep_df[col] = keep_df[col].apply(lambda x: None if pandas.isnull(x) else '{0:.0f}'.format(pandas.to_numeric(x)))
Corbin
  • 318
  • 1
  • 8
3

The issue with Int64, like many other's solutions, is that if you have null values, they get replaced with <NA> values, which do not work with pandas default 'NaN' functions, like isnull() or fillna(). Or if you convert values to -1 you end up in a situation where you may be deleting your information. My solution is a little lame, but will provide int values with np.nan, allowing for nan functions to work without compromising your values.

            def to_int(x):
                try:
                    return int(x)
                except:
                    return np.nan

            df[column] = df[column].apply(to_int)
WolVes
  • 1,286
  • 2
  • 19
  • 39
2
import pandas as pd

df= pd.read_csv("data.csv")
df['id'] = pd.to_numeric(df['id'])
  • 4
    Is there a reason you prefer this formulation over that proposed in the accepted answer? If so, it'd be useful to edit your answer to provide that explanation—and especially since there are ten _additional_ answers that are competing for attention. – Jeremy Caney Jun 06 '20 at 00:38
  • 1
    While this code may resolve the OP's issue, it is best to include an explanation as to how/why your code addresses it. In this way, future visitors can learn from your post, and apply it to their own code. SO is not a coding service, but a resource for knowledge. Also, high quality, complete answers are more likely to be upvoted. These features, along with the requirement that all posts are self-contained, are some of the strengths of SO as a platform differentiates it from forums. You can `edit` to add additional info &/or to supplement your explanations with source documentation. – SherylHohman Jun 06 '20 at 01:35
2

If you want to use it when you chain methods, you can use assign:

df = (
     df.assign(col = lambda x: x['col'].astype('Int64'))
)
Mehdi Golzadeh
  • 2,594
  • 1
  • 16
  • 28
2

For anyone needing to have int values within NULL/NaN-containing columns, but working under the constraint of being unable to use pandas version 0.24.0 nullable integer features mentioned in other answers, I suggest converting the columns to object type using pd.where:

df = df.where(pd.notnull(df), None)

This converts all NaNs in the dataframe to None, treating mixed-type columns as objects, but leaving the int values as int, rather than float.

TWebbs
  • 21
  • 1
  • 3
2

First you need to specify the newer integer type, Int8 (...Int64) that can handle null integer data (pandas version >= 0.24.0)

df = df.astype('Int8')

But you may want to only target specific columns which have integer data mixed with NaN/nulls:

df = df.astype({'col1':'Int8','col2':'Int8','col3':'Int8')

At this point, the NaN's are converted into <NA> and if you want to change the default null value with df.fillna(), you need to coerce the object datatype on the columns you wish to change, otherwise you will see TypeError: <U1 cannot be converted to an IntegerDtype

You can do this by df = df.astype(object) if you don't mind changing every column datatype to object (individually, each value's type is still preserved) ... OR df = df.astype({"col1": object,"col2": object}) if you prefer to target individual columns.

This should help with forcing your integer columns mixed with nulls to stay formatted as integers and change the null values to whatever you like. I can't speak to the efficiency of this method, but it worked for my formatting and printing purposes.

1

I ran into this issue working with pyspark. As this is a python frontend for code running on a jvm, it requires type safety and using float instead of int is not an option. I worked around the issue by wrapping the pandas pd.read_csv in a function that will fill user-defined columns with user-defined fill values before casting them to the required type. Here is what I ended up using:

def custom_read_csv(file_path, custom_dtype = None, fill_values = None, **kwargs):
    if custom_dtype is None:
        return pd.read_csv(file_path, **kwargs)
    else:
        assert 'dtype' not in kwargs.keys()
        df = pd.read_csv(file_path, dtype = {}, **kwargs)
        for col, typ in custom_dtype.items():
            if fill_values is None or col not in fill_values.keys():
                fill_val = -1
            else:
                fill_val = fill_values[col]
            df[col] = df[col].fillna(fill_val).astype(typ)
    return df
Neuneck
  • 294
  • 1
  • 3
  • 14
1

Try this:

df[['id']] = df[['id']].astype(pd.Int64Dtype())

If you print it's dtypes, you will get id Int64 instead of normal one int64

Nikhil Redij
  • 1,011
  • 1
  • 14
  • 21
1

Use .fillna() to replace all NaN values with 0 and then convert it to int using astype(int)

df['id'] = df['id'].fillna(0).astype(int)
Alex Metsai
  • 1,837
  • 5
  • 12
  • 24
0

First remove the rows which contain NaN. Then do Integer conversion on remaining rows. At Last insert the removed rows again. Hope it will work

kamran kausar
  • 4,117
  • 1
  • 23
  • 17
0

Had a similar problem. That was my solution:

def toint(zahl = 1.1):
    try:
        zahl = int(zahl)
    except:
        zahl = np.nan
    return zahl

print(toint(4.776655), toint(np.nan), toint('test'))

4 nan nan

df = pd.read_csv("data.csv") 
df['id'] = df['id'].astype(float)
df['id'] = toint(df['id'])
mqx
  • 11
  • 4
0

I think the approach of @Digestible1010101 is the more appropriate for Pandas 1.2.+ versions, something like this should do the job:

df = df.astype({
            'col_1': 'Int64',
            'col_2': 'Int64',
            'col_3': 'Int64',
            'col_4': 'Int64', })
Nimantha
  • 6,405
  • 6
  • 28
  • 69
David I. Rock
  • 95
  • 2
  • 4
0

Since I didn't see the answer here, I might as well add it:

One-liner to convert NANs to empty string if you for some reason you still can't handle np.na or pd.NA like me when relying on a library with an older version of pandas:

df.select_dtypes('number').fillna(-1).astype(str).replace('-1', '')

lassebenninga
  • 31
  • 1
  • 4
0

Similar to @hibernado's answer, but keeping it as integers (instead of strings)

df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = np.where(df[col] == -1, np.nan, df[col])
0
df.loc[~df['id'].isna(), 'id'] = df.loc[~df['id'].isna(), 'id'].astype('int')

okadahiroshi
  • 171
  • 1
  • 8
0

df['id'] = df['id'].astype('float').astype(pd.Int64Dtype())

0

I use the following workaround:

condition = (~df['mixed_column'].isnull())
df['mixed_column'] = df['mixed_column'].mask(condition, df[condition]['mixed_column'].astype(int))
Yashar Ahmadov
  • 1,616
  • 2
  • 10
  • 21
-2

Assuming your DateColumn formatted 3312018.0 should be converted to 03/31/2018 as a string. And, some records are missing or 0.

df['DateColumn'] = df['DateColumn'].astype(int)
df['DateColumn'] = df['DateColumn'].astype(str)
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.zfill(8))
df.loc[df['DateColumn'] == '00000000','DateColumn'] = '01011980'
df['DateColumn'] = pd.to_datetime(df['DateColumn'], format="%m%d%Y")
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.strftime('%m/%d/%Y'))
Justin Malinchak
  • 509
  • 1
  • 6
  • 11
-2

use pd.to_numeric()

df["DateColumn"] = pd.to_numeric(df["DateColumn"])

simple and clean

KeepLearning
  • 517
  • 7
  • 10
  • 4
    If there are NaN values in the column, pd.to_numeric will convert the dtype to float not int because NaN is considered a float. – Bradon Sep 20 '20 at 15:50