3

I have a dataset where the pandas.read_csv() processing appropriately casted some continuous numeric column/feature/variable data from object to float64 [ , int64 or uint8 ] but not others.

So I then try and convert the column data that should have been cast as continuous numeric type, specifically int64, using the following pandas.to_numeric() call with downcast parameter specified yet I still get a float64 result.

df.wc = pd.to_numeric(df.wc, errors='coerce', downcast='signed') 
# call to convert object to int64 vs float64 

Is there a typical column/feature/variable set issue that will cause that parameter setting to be ignored when attempting to cast an object type to the most specific continuous numeric type?

myusrn
  • 1,050
  • 2
  • 15
  • 29
  • what happens when you try `errors='raise'`? downcast according to the [docs](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_numeric.html) only works for the following condition `If not None, and if the data has been successfully cast to a numerical dtype`. So doing `errors='coerce'` may be hiding something. On a side note... do you have any legimitate float numbers in your dataset? (ie 1.0) – MattR Jan 30 '18 at 20:49
  • Can you show a small sample dataframe with the values that are not being converted as you expect? – JohnE Jan 30 '18 at 20:50
  • 1
    Do your `int` columns contain `NaN` values? Then these columns can't be converted from float -> int. I asked a similar question today: https://stackoverflow.com/q/48518735/8881141 – Mr. T Jan 30 '18 at 20:51

2 Answers2

3

According to documentation

... downcast that resulting data to the smallest numerical dtype possible according ...

According to my experiments, it's possible to downcast to integer values like

pd.to_numeric(pd.Series([1.0, 2.0]), downcast='unsigned')
0    1
1    2
dtype: uint8

Though, it's not possible to downcast to integer values like

pd.to_numeric(pd.Series([1.1, 2.1]), downcast='unsigned')
0    1.1
1    2.1
dtype: float64

If you want to get int64 values in the result, then you can apply pd.Series.astype

pd.Series([1.1, 2.1]).astype(int)
0    1
1    2
dtype: int64

You may be interested in

mr.tarsa
  • 6,386
  • 3
  • 25
  • 42
  • 1
    thanks additional details on interpretation of documentation for to_numeric and your results. in my case the dataset does not contain decimals, e.g. pd.to_numeric(pd.Series([1000, 1150, 2250]), downcast='signed') and ='unsigned') is generating expected int16 and uint16 with that example series but not with my actual dataset. So it would seem i'll have to comb through it more closely to determine what is causing it to kick out float64 result in spite of downcast='signed' / 'unsigned' parameter setting. – myusrn Jan 30 '18 at 21:19
  • I tried df.columnWithInts.astype(int) and it generates "ValueError: cannot convert float NaN to integer" so there must be some observation in that set that doesn't look like an integer – myusrn Jan 30 '18 at 21:22
  • 1
    @myusrn Yes, you need to handle `NaN` values before converting to ints. Look at [`pd.Series.fillna()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.fillna.html) or [`pd.Series.dropna()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.dropna.html). But best you can do is to provide a sample of data that you are struggling with. – mr.tarsa Jan 30 '18 at 21:27
  • 1
    thanks for additional insights and suggestion. I introduced a df.dropna(axis=0, how='any', inplace=True) call before call to convert object/str to relevant integer and float column types at which point I do then get the expected int64 and float64 conversion results w/o pd.to_numeric downcast parameter setting involved. So my confusion is I was under impression rows/records/observations where the column/feature/variable in question had a NaN entry would not affect logic that controlled what type pd_numeric() converted the entire column to. – myusrn Jan 31 '18 at 03:26
2

When using

pandas.to_numeric(df[some_column], errors='coerce', downcast='integer')

it seems that any "not downcastable" value in the some_column makes the whole column not downcasted.

One walkthrough is to separate the removal of non numeric values and the downcast to signed or int:

df[some_column]=pd.to_numeric(df[some_column], errors='coerce')
df.dropna(subset = [some_column], inplace = True)
df[some_column]=pd.to_numeric(df[some_column], downcast='integer')

First line sets non numeric values to NaN. Second line drops them in place. Third line cast them to integer.

Ken
  • 442
  • 5
  • 11