Convert Pandas column containing NaNs to dtype `int`

Question

I read data from a .csv file to a Pandas dataframe as below. For one of the columns, namely id, I want to specify the column type as int. The problem is the id series has missing/empty values.

When I try to cast the id column to integer while reading the .csv, I get:

df= pd.read_csv("data.csv", dtype={'id': int}) 
error: Integer column has NA values

Alternatively, I tried to convert the column type after reading as below, but this time I get:

df= pd.read_csv("data.csv") 
df[['id']] = df[['id']].astype(int)
error: Cannot convert NA to integer

How can I tackle this?

@xndrme, the file itself is too large. I will see if I can create a small test case. But essentially the situation is that the `id` column has many integer values and some empty/missing cells. — Zhubarb, Jan 22 '14 at 16:00
I think that integer values cannot be converted or stored in a series/dataframe if there are missing/NaN values. This I think is to do with numpy compatibility (I'm guessing here), if you want missing value compatibility then I would store the values as floats — EdChum, Jan 22 '14 at 16:14
see here: http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions; you must have a float dtype when u have missing values (or technically object dtype but that is inefficient); what is your goal of using int type? — Jeff, Jan 22 '14 at 16:16
FYI, if you don't specify a dtype, then pandas will infer float for the column, no conversion needed. — Jeff, Jan 22 '14 at 16:26
I believe this is a NumPy issue, not specific to Pandas. It's a shame since there are so many cases when having an int type that allows for the possibility of null values is much more efficient than a large column of floats. — ely, Jan 22 '14 at 17:44
I have a problem with this too. I have multiple dataframes which I want to merge based on a string representation of several "integer" columns. However, when one of those integer columns has a np.nan, the string casting produces a ".0", which throws off the merge. Just makes things slightly more complicated, would be nice if there was simple work-around. — dermen, Jul 11 '15 at 03:52
@Rhubarb, Optional Nullable Integer Support is now officially added on pandas 0.24.0 - finally :) - please find an updated answer bellow. [pandas 0.24.x release notes](https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support) — mork, Jan 25 '19 at 17:14

jezrael · Answer 1 · 2019-11-05T06:13:22.670

334

In version 0.24.+ pandas has gained the ability to hold integer dtypes with missing values.

Nullable Integer Data Type.

Pandas can represent integer data with possibly missing values using arrays.IntegerArray. This is an extension types implemented within pandas. It is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into array() or Series:

arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
pd.Series(arr)

0      1
1      2
2    NaN
dtype: Int64

For convert column to nullable integers use:

df['myCol'] = df['myCol'].astype('Int64')

edited Nov 05 '19 at 06:13

answered Jan 15 '19 at 08:13

jezrael

822,522
95
1,334
1,252

75

Note that dtype must be `"Int64"` and not `"int64"` (first 'i' must be capitalized) – Viacheslav Zhukov Oct 03 '19 at 18:08
5

`df.myCol = df.myCol.astype('Int64')` or `df['myCol'] = df['myCol'].astype('Int64')` – LoMaPh Nov 04 '19 at 21:38
9

It may be obvious to some but it I think it is still worth noting that you can use any Int (e.g. `Int16`, `Int32`) and indeed probably should if the dataframe is very large to save memory. – wfgeo Sep 21 '20 at 12:42
@jezrael, in which cases this does not work...? It does not work for me and I have failed to find a generic solution. – Newbielp Jan 21 '21 at 13:52
@Newbielp - First idea is test if `nan` are missing values or strings `'nan'` – jezrael Jan 21 '21 at 13:53
@jezrael, it gives me type float for specific element. Whole column is type object. – Newbielp Jan 21 '21 at 13:59
3

I'm getting `TypeError: cannot safely cast non-equivalent float64 to int64` – BERA Sep 13 '21 at 11:31
@Zhubarb please consider changing your accepted answer in light of this development – robertspierre Dec 17 '21 at 15:44
1

As of pandas 1.4, IntegerArray and `pandas.NA` are still marked as **experimental** – creanion Jul 18 '22 at 15:01
Use `dtype` argument as `Int64` and not as `np.int64` or `int64` – BetterCallMe Sep 06 '22 at 07:39

score 260 · Accepted Answer · edited Apr 07 '19 at 19:23

260

The lack of NaN rep in integer columns is a pandas "gotcha".

The usual workaround is to simply use floats.

edited Apr 07 '19 at 19:23

agsimmons

3
3

answered Jan 22 '14 at 17:42

Andy Hayden

359,921
101
625
535

30

Are there any other workarounds besides treating them like floats? – NumenorForLife May 14 '15 at 23:26
5

@jsc123 you can use the object dtype. This comes with a small health warning but for the most part works well. – Andy Hayden May 19 '15 at 15:16
1

Can you provide an example of how to use object dtype? I've been looking through the pandas docs and googling, and I've read it's the recommended method. But, I haven't found an example of how to use the object dtype. – MikeyE Aug 15 '16 at 03:23
66

In v0.24, you can now do `df = df.astype(pd.Int32Dtype())` (to convert the entire dataFrame, or) `df['col'] = df['col'].astype(pd.Int32Dtype())`. Other accepted nullable integer types are `pd.Int16Dtype` and `pd.Int64Dtype`. Pick your poison. – cs95 Apr 02 '19 at 07:56
2

It is NaN value but isnan checking doesn't work at all :( – Winston Jul 31 '19 at 09:48
See https://stackoverflow.com/questions/58029359/pandas-convert-column-to-int-and-coerce-nan – PatrickT Oct 25 '21 at 05:36
@cs95 I am getting erro `object cannot be converted to an IntegerDtype` – Henrique Brisola Dec 06 '21 at 16:59

hibernado · Answer 3 · 2018-10-29T09:30:19.890

76

My use case is munging data prior to loading into a DB table:

df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = df[col].astype(str)
df[col] = df[col].replace('-1', np.nan)

Remove NaNs, convert to int, convert to str and then reinsert NANs.

It's not pretty but it gets the job done!

edited Oct 29 '18 at 09:30

answered May 02 '18 at 10:28

hibernado

1,690
1
18
19

2

I have been pulling my hair out trying to load serial numbers where some are null and the rest are floats, this saved me. – Chris Decker Jan 15 '19 at 17:51
3

The OP wants a column of integers. Converting it to string does not meet the condition. – Rishab Gupta Feb 21 '19 at 01:33
5

Works only if col doesn't already have -1. Otherwise, it will mess with the data – Sharvari Gc Oct 10 '19 at 04:55
1

then how to get back to int..?? – abdoulsn Jan 23 '20 at 09:48
This produces a column of strings!! For a solution with current versions of `pandas`, see https://stackoverflow.com/questions/58029359/pandas-convert-column-to-int-and-coerce-nan – PatrickT Oct 25 '21 at 05:39
Use case here for this answer is trying to load to a DB - so potentially writing to csv then bulk insert - in this case, forcing the int to string then writing it stops SQL complaining that e.g. 10.0 is not an int and can't be loaded. But not a solution for all cases. – tim654321 Jun 27 '23 at 23:59

score 13 · Answer 4 · answered Jul 16 '21 at 08:24

13

Whether your pandas series is object datatype or simply float datatype the below method will work

df = pd.read_csv("data.csv") 
df['id'] = df['id'].astype(float).astype('Int64')

answered Jul 16 '21 at 08:24

Abhishek Bhatia

547
4
11

Thank you @Abhishek Bhatia this worked for me. – Jane Kathambi Jan 26 '22 at 16:26
This is one of the better answers on this thread. – drake Dec 02 '22 at 19:08

mork · Answer 5 · 2019-01-25T17:55:59.740

12

It is now possible to create a pandas column containing NaNs as dtype int, since it is now officially added on pandas 0.24.0

pandas 0.24.x release notes Quote: "Pandas has gained the ability to hold integer dtypes with missing values

edited Jan 25 '19 at 17:55

answered Jan 25 '19 at 17:13

mork

1,747
21
23

Kamil · Answer 6 · 2021-04-01T02:07:03.877

7

I had the problem a few weeks ago with a few discrete features which were formatted as 'object'. This solution seemed to work.

for col in discrete:
    df[col] = pd.to_numeric(df[col],errors='coerce').astype(pd.Int64Dtype())

edited Apr 01 '21 at 02:07

answered Sep 23 '20 at 22:44

Kamil

275
3
8

jmenglund · Answer 7 · 2018-11-22T15:42:21.117

6

If you absolutely want to combine integers and NaNs in a column, you can use the 'object' data type:

df['col'] = (
    df['col'].fillna(0)
    .astype(int)
    .astype(object)
    .where(df['col'].notnull())
)

This will replace NaNs with an integer (doesn't matter which), convert to int, convert to object and finally reinsert NaNs.

edited Nov 22 '18 at 15:42

answered Nov 22 '18 at 15:27

jmenglund

69
1
4

score 6 · Answer 8 · answered Sep 15 '20 at 03:26

As of Pandas 1.0.0 you can now use pandas.NA values. This does not force integer columns with missing values to be floats.

When reading in your data all you have to do is:

df= pd.read_csv("data.csv", dtype={'id': 'Int64'})

Notice the 'Int64' is surrounded by quotes and the I is capitalized. This distinguishes Panda's 'Int64' from numpy's int64.

As a side note, this will also work with .astype()

df['id'] = df['id'].astype('Int64')

Documentation here https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

elomage · Answer 9 · 2018-09-13T10:35:34.980

You could use .dropna() if it is OK to drop the rows with the NaN values.

df = df.dropna(subset=['id'])

Alternatively, use .fillna() and .astype() to replace the NaN with values and convert them to int.

I ran into this problem when processing a CSV file with large integers, while some of them were missing (NaN). Using float as the type was not an option, because I might loose the precision.

My solution was to use str as the intermediate type. Then you can convert the string to int as you please later in the code. I replaced NaN with 0, but you could choose any value.

df = pd.read_csv(filename, dtype={'id':str})
df["id"] = df["id"].fillna("0").astype(int)

For the illustration, here is an example how floats may loose the precision:

s = "12345678901234567890"
f = float(s)
i = int(f)
i2 = int(s)
print (f, i, i2)

And the output is:

1.2345678901234567e+19 12345678901234567168 12345678901234567890

score 3 · Answer 10 · answered Apr 25 '16 at 08:38

If you can modify your stored data, use a sentinel value for missing id. A common use case, inferred by the column name, being that id is an integer, strictly greater than zero, you could use 0 as a sentinel value so that you can write

if row['id']:
   regular_process(row)
else:
   special_process(row)

score 3 · Answer 11 · answered Dec 12 '18 at 21:43

Most solutions here tell you how to use a placeholder integer to represent nulls. That approach isn't helpful if you're uncertain that integer won't show up in your source data though. My method with will format floats without their decimal values and convert nulls to None's. The result is an object datatype that will look like an integer field with null values when loaded into a CSV.

keep_df[col] = keep_df[col].apply(lambda x: None if pandas.isnull(x) else '{0:.0f}'.format(pandas.to_numeric(x)))

this approach can add a lot of memory overhead, especially on larger dataframes — zelusp, Feb 18 '21 at 19:36

score 3 · Answer 12 · answered Mar 16 '21 at 00:26

The issue with Int64, like many other's solutions, is that if you have null values, they get replaced with <NA> values, which do not work with pandas default 'NaN' functions, like isnull() or fillna(). Or if you convert values to -1 you end up in a situation where you may be deleting your information. My solution is a little lame, but will provide int values with np.nan, allowing for nan functions to work without compromising your values.

            def to_int(x):
                try:
                    return int(x)
                except:
                    return np.nan

            df[column] = df[column].apply(to_int)

score 2 · Answer 13 · answered Jun 06 '20 at 00:09

2

import pandas as pd

df= pd.read_csv("data.csv")
df['id'] = pd.to_numeric(df['id'])

answered Jun 06 '20 at 00:09

Monaheng Ramochele

41
3

4

Is there a reason you prefer this formulation over that proposed in the accepted answer? If so, it'd be useful to edit your answer to provide that explanation—and especially since there are ten _additional_ answers that are competing for attention. – Jeremy Caney Jun 06 '20 at 00:38
1

While this code may resolve the OP's issue, it is best to include an explanation as to how/why your code addresses it. In this way, future visitors can learn from your post, and apply it to their own code. SO is not a coding service, but a resource for knowledge. Also, high quality, complete answers are more likely to be upvoted. These features, along with the requirement that all posts are self-contained, are some of the strengths of SO as a platform differentiates it from forums. You can `edit` to add additional info &/or to supplement your explanations with source documentation. – SherylHohman Jun 06 '20 at 01:35

score 2 · Answer 14 · answered Oct 23 '20 at 15:30

2

If you want to use it when you chain methods, you can use assign:

df = (
     df.assign(col = lambda x: x['col'].astype('Int64'))
)

answered Oct 23 '20 at 15:30

Mehdi Golzadeh

2,594
1
16
28

score 2 · Answer 15 · answered May 07 '21 at 15:41

For anyone needing to have int values within NULL/NaN-containing columns, but working under the constraint of being unable to use pandas version 0.24.0 nullable integer features mentioned in other answers, I suggest converting the columns to object type using pd.where:

df = df.where(pd.notnull(df), None)

This converts all NaNs in the dataframe to None, treating mixed-type columns as objects, but leaving the int values as int, rather than float.

score 2 · Answer 16 · answered Jun 10 '21 at 23:34

First you need to specify the newer integer type, Int8 (...Int64) that can handle null integer data (pandas version >= 0.24.0)

df = df.astype('Int8')

But you may want to only target specific columns which have integer data mixed with NaN/nulls:

df = df.astype({'col1':'Int8','col2':'Int8','col3':'Int8')

At this point, the NaN's are converted into <NA> and if you want to change the default null value with df.fillna(), you need to coerce the object datatype on the columns you wish to change, otherwise you will see TypeError: <U1 cannot be converted to an IntegerDtype

You can do this by df = df.astype(object) if you don't mind changing every column datatype to object (individually, each value's type is still preserved) ... OR df = df.astype({"col1": object,"col2": object}) if you prefer to target individual columns.

This should help with forcing your integer columns mixed with nulls to stay formatted as integers and change the null values to whatever you like. I can't speak to the efficiency of this method, but it worked for my formatting and printing purposes.

score 1 · Answer 17 · answered May 23 '18 at 08:45

I ran into this issue working with pyspark. As this is a python frontend for code running on a jvm, it requires type safety and using float instead of int is not an option. I worked around the issue by wrapping the pandas pd.read_csv in a function that will fill user-defined columns with user-defined fill values before casting them to the required type. Here is what I ended up using:

def custom_read_csv(file_path, custom_dtype = None, fill_values = None, **kwargs):
    if custom_dtype is None:
        return pd.read_csv(file_path, **kwargs)
    else:
        assert 'dtype' not in kwargs.keys()
        df = pd.read_csv(file_path, dtype = {}, **kwargs)
        for col, typ in custom_dtype.items():
            if fill_values is None or col not in fill_values.keys():
                fill_val = -1
            else:
                fill_val = fill_values[col]
            df[col] = df[col].fillna(fill_val).astype(typ)
    return df

score 1 · Answer 18 · answered Sep 23 '20 at 11:28

1

Try this:

df[['id']] = df[['id']].astype(pd.Int64Dtype())

If you print it's dtypes, you will get id Int64 instead of normal one int64

answered Sep 23 '20 at 11:28

Nikhil Redij

1,011
1
14
21

score 1 · Answer 19 · edited Apr 12 '21 at 06:57

1

Use .fillna() to replace all NaN values with 0 and then convert it to int using astype(int)

df['id'] = df['id'].fillna(0).astype(int)

edited Apr 12 '21 at 06:57

Alex Metsai

1,837
5
12
24

answered Apr 12 '21 at 06:12

Naina Gerwani

35
1

4

Works but I think replacing NaN with 0 changes the meaning of the data. – Jane Kathambi Jan 26 '22 at 16:27

score 0 · Answer 20 · answered Aug 01 '18 at 10:05

0

First remove the rows which contain NaN. Then do Integer conversion on remaining rows. At Last insert the removed rows again. Hope it will work

answered Aug 01 '18 at 10:05

kamran kausar

4,117
1
23
17

mqx · Answer 21 · 2021-05-28T23:17:12.143

0

Had a similar problem. That was my solution:

def toint(zahl = 1.1):
    try:
        zahl = int(zahl)
    except:
        zahl = np.nan
    return zahl

print(toint(4.776655), toint(np.nan), toint('test'))

4 nan nan

df = pd.read_csv("data.csv") 
df['id'] = df['id'].astype(float)
df['id'] = toint(df['id'])

edited May 28 '21 at 23:17

answered May 28 '21 at 22:27

mqx

11
4

score 0 · Answer 22 · edited Nov 25 '21 at 04:26

0

I think the approach of @Digestible1010101 is the more appropriate for Pandas 1.2.+ versions, something like this should do the job:

df = df.astype({
            'col_1': 'Int64',
            'col_2': 'Int64',
            'col_3': 'Int64',
            'col_4': 'Int64', })

edited Nov 25 '21 at 04:26

Nimantha

6,405
6
28
69

answered Jun 29 '21 at 03:47

David I. Rock

95
2
4

score 0 · Answer 23 · answered Jul 22 '21 at 07:48

0

Since I didn't see the answer here, I might as well add it:

One-liner to convert NANs to empty string if you for some reason you still can't handle np.na or pd.NA like me when relying on a library with an older version of pandas:

df.select_dtypes('number').fillna(-1).astype(str).replace('-1', '')

answered Jul 22 '21 at 07:48

lassebenninga

31
1
4

2

caution with this approach... if any of your data really is -1, it will be overwritten. – bsauce Mar 17 '22 at 16:55

score 0 · Answer 24 · answered Aug 09 '22 at 08:59

0

Similar to @hibernado's answer, but keeping it as integers (instead of strings)

df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = np.where(df[col] == -1, np.nan, df[col])

answered Aug 09 '22 at 08:59

Brad Coorey

1
2

score 0 · Answer 25 · answered Sep 21 '22 at 10:43

0

df.loc[~df['id'].isna(), 'id'] = df.loc[~df['id'].isna(), 'id'].astype('int')

answered Sep 21 '22 at 10:43

okadahiroshi

171
1
8

score 0 · Answer 26 · answered Mar 28 '23 at 07:21

0

df['id'] = df['id'].astype('float').astype(pd.Int64Dtype())

answered Mar 28 '23 at 07:21

Sohail Anjum

84
7

score 0 · Answer 27 · answered Apr 27 '23 at 08:45

0

I use the following workaround:

condition = (~df['mixed_column'].isnull())
df['mixed_column'] = df['mixed_column'].mask(condition, df[condition]['mixed_column'].astype(int))

answered Apr 27 '23 at 08:45

Yashar Ahmadov

1,616
2
10
21

Justin Malinchak · Answer 28 · 2018-04-06T14:45:13.340

Assuming your DateColumn formatted 3312018.0 should be converted to 03/31/2018 as a string. And, some records are missing or 0.

df['DateColumn'] = df['DateColumn'].astype(int)
df['DateColumn'] = df['DateColumn'].astype(str)
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.zfill(8))
df.loc[df['DateColumn'] == '00000000','DateColumn'] = '01011980'
df['DateColumn'] = pd.to_datetime(df['DateColumn'], format="%m%d%Y")
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.strftime('%m/%d/%Y'))

score -2 · Answer 29 · answered Aug 31 '20 at 00:07

-2

use pd.to_numeric()

df["DateColumn"] = pd.to_numeric(df["DateColumn"])

simple and clean

answered Aug 31 '20 at 00:07

KeepLearning

517
7
10

4

If there are NaN values in the column, pd.to_numeric will convert the dtype to float not int because NaN is considered a float. – Bradon Sep 20 '20 at 15:50

Convert Pandas column containing NaNs to dtype `int`

29 Answers29

Linked

Related