-1

I have a pandas series where data is of type string.

import pandas as pd
import numpy as np

df['count']

0         4
1       nan
2       nan
3         1
4       nan
5       227.0
        ... 
635     nan
636     nan
637       1
638       8
639    None

I'd like to replace nan with np.nan and set the data type of numeric values to int

I tried this, but it didn't work. Values are still strings.

df['count'] = df['count'].apply(lambda x: np.where(x.isdigit(), x, np.nan))
kms
  • 1,810
  • 1
  • 41
  • 92
  • 1
    `np.nan` is a float, so the column dtype will still be `object`. – hpaulj Apr 03 '22 at 16:12
  • why not replace anything that doesn't convert to numeric with `np.nan`? And keep the type as float, following @hpaulj's comment. – Pierre D Apr 03 '22 at 16:58
  • `np.where(cond, A, B)` evaluates all of `A` and `B`. It selects between the two, but it is NOT an conditional evaluator. In other words if is not `if cond do A else do B`. – hpaulj Apr 04 '22 at 00:08

2 Answers2

1

What I would do is:

df['count'] = pd.to_numeric(df['count'], errors='coerce')

After that, your column will be np.float64 and anything that could not be converted to a float will be np.nan.

A common way to convert such a column to int is to chose a value to replace 'nan'. That is application-dependent, of course, but since your column name is 'count', a value of -1 could perhaps be adequate.

Alternatively, you can use pandas' nullable integer.

Example

df = pd.DataFrame('4 nan nan 1 nan 227.0 1 8 None'.split(), columns=['count'])

>>> df
   count
0      4
1    nan
2    nan
3      1
4    nan
5  227.0
6      1
7      8
8   None

Method 1: convert to numeric, then to int with -1 to indicate "bad value":

newdf = df.assign(
    count=pd.to_numeric(df['count'], errors='coerce')
    .fillna(-1)
    .astype(int)
)

>>> newdf
   count
0      4
1     -1
2     -1
3      1
4     -1
5    227
6      1
7      8
8     -1

Method 2: convert to 'Int64' (nullable integer):

newdf = df.assign(
    count=pd.to_numeric(df['count'], errors='coerce')
    .astype('Int64')
)

>>> newdf
   count
0      4
1   <NA>
2   <NA>
3      1
4   <NA>
5    227
6      1
7      8
8   <NA>
Pierre D
  • 24,012
  • 7
  • 60
  • 96
0

You can run a list comprehension for this:

df['count'] = [int(val) if val != "nan" else np.nan for val in df['count']]

If the casting doesn't work (i.e. you have other numerics that are not int), you can try casting as a float as a catch all and then convert to an int: int(float(val))

df['count'] = [int(float(val)) if val != "nan" else np.nan for val in df['count']]

Alternatively, if you wish to be more pandthonic, you can try to use one of these approaches (specifically, to_numeric()).

Kevin M
  • 158
  • 10
  • 1
    This is actually a list comprehension, not just for loop –  Apr 03 '22 at 16:42
  • @KevinM Found a case where this fails: ValueError: invalid literal for int() with base 10: '227.0'. I modified the question. – kms Apr 03 '22 at 16:45
  • Changed response accordingly. I still recommend using a list comprehension, as it's stricter, but Pandas' `to_numeric` is also an option. – Kevin M Apr 03 '22 at 17:01
  • For example, your value of `None` will cause the list comprehension to error but not with `to_numeric()`/Pierre D's response. Your *question* is asking about *nans*. Is `None` suppose to be taken up like nan or is it an inherit issue with your dataset? You could potentially be ignoring problems in your dataset by using `to_numeric()` blindly when the latter is the case. – Kevin M Apr 03 '22 at 17:33