0

This is different from error using astype when NaN exists in a dataframe because I need to keep the NaN values, so I've chosen to use the experimental IntegerArray. The crux of this issue is trying to avoid the looping.

We have a number of large medical datasets that I am importing into Pandas from SAS. The majority of the fields are enumerated types and should be represented as integers, but they are coming in as float64 because many contain NaN values. The experimental IntegerArray type in Pandas solves the NaN problem. However, these datasets are very large and I'd like to convert them in a script based on the data itself. The following script works, but is extremely slow and I have figured out a more Pythonic or "Pandorable" way of writing it.

# Convert any non-float fields to IntegerArray (Int)
# Note than IntegerArrays are an experimental addition in Pandas 0.24. They
# allow integer columns to contain NaN fields like float columns.
#
# This is a rather brute-force technique that loops through every column
# and every row. There's got to be a more efficient way to do it since it 
# takes a long time and uses up a lot of memory.
def convert_integer (df):
    for col in df.columns:
        intcol_flag = True
        if df[col].dtype == 'float64':   # Assuming dtype is "float64"
            # TODO: Need to remove inner loop - SLOW!
            for val in df[col]:
                # If not NaN and the int() value is different from
                # the float value, then we have an actual float.
                if pd.notnull(val) and abs(val - int(val)) > 1e-6:
                    intcol_flag = False
                    break;
            # If not a float, change it to an Int based on size
            if intcol_flag:
                if df[col].abs().max() < 127:
                    df[col] = df[col].astype('Int8')
                elif df[col].abs().max() < 32767:
                    df[col] = df[col].astype('Int16')
                else:   # assuming no ints greater than 2147483647 
                    df[col] = df[col].astype('Int32') 
        print(f"{col} is {df[col].dtype}")
    return df

I assumed that the inner for loop was the problem, but I tried replacing it with:

            s = df[col].apply(lambda x: pd.notnull(x) and abs(x - int(x)) > 1e-6)
            if s.any():
                intcol_flag = False

and it is still just as slow.

Here is some sample data and the desired output:

np.random.seed(10)
df = pd.DataFrame(np.random.choice([1, 2, 3.3, 5000, 111111, np.NaN], (3,9)), 
                  columns=[f'col{i}' for i in range(9)])
df

    col0    col1    col2    col3    col4    col5    col6    col7    col8
0   2.0     NaN   111111.0  1.0     2.0   5000.0  111111.0  2.0     NaN
1   1.0     NaN     2.0     3.3     1.0      2.0     1.0    3.3     1.0
2  111111.0 5000.0  1.0   111111.0  5000.0   1.0    5000.0  3.3     2.0

And the result should be:

col0 is Int32
col1 is Int16
col2 is Int32
col3 is float64
col4 is Int16
col5 is Int16
col6 is Int32
col7 is float64
col8 is Int8
MikeBopf
  • 50
  • 1
  • 8
  • Possible duplicate of [error using astype when NaN exists in a dataframe](https://stackoverflow.com/questions/41550746/error-using-astype-when-nan-exists-in-a-dataframe) – David Zemens Oct 21 '19 at 17:03
  • 1
    `apply` is essentially a for loop, so it does you no good. Use `s = df[col].notnull() & df[col].sub(df[col].astype(int)).abs().gt(1e6)` – Quang Hoang Oct 21 '19 at 17:03
  • 1
    Do you need to preserve the NaN values? Those are of type float. See proposed duplicate Q – David Zemens Oct 21 '19 at 17:03
  • You can try timing different portions of your code to see what is taking a long time. https://pythonhow.com/measure-execution-time-python-code/ – ShayneLoyd Oct 21 '19 at 17:05
  • @DavidZemens Yes, I want to preserve the NaNs, and the experimental IntegerArray added in Pandas v 0.24 works well for this. So the conversion isn't the problem, just efficiency (and ugly looping). – MikeBopf Oct 21 '19 at 17:18
  • @QuangHoang - Thanks that works except for the "astype(int)" which gives an error because of the NaN values. Substituting round() for the astype seems to work, but it still seems slow. I'll run some timing tests. – MikeBopf Oct 21 '19 at 17:49

1 Answers1

2

Find the columns you need to typecast to each type then do it all at once for each type.

Sample Data

import pandas as pd
import numpy as np

np.random.seed(10)
df = pd.DataFrame(np.random.choice([1, 2, 3.3, 5000, 111111, np.NaN], (3,9)), 
                  columns=[f'col{i}' for i in range(9)])

Code

s = pd.cut(df.max(), bins=[0, 127, 32767, 2147483647], labels=['Int8', 'Int16', 'Int32'])
s = s.where((df.dtypes=='float') & (df.isnull() | (df%1 == 0)).all())
            # Cast previously       # If all values are 
            # float columns         # "I"nteger-like

for idx, gp in s.groupby(s):
    df.loc[:, gp.index] = df.loc[:, gp.index].astype(idx)

df.dtypes
#col0      Int32
#col1      Int16
#col2      Int32
#col3    float64
#col4      Int16
#col5      Int16
#col6      Int32
#col7    float64
#col8       Int8
#dtype: object

print(df)
#     col0  col1    col2      col3  col4  col5    col6  col7  col8
#0       2   NaN  111111       1.0     2  5000  111111   2.0   NaN
#1       1   NaN       2       3.3     1     2       1   3.3     1
#2  111111  5000       1  111111.0  5000     1    5000   3.3     2
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • I like where you're going with this, but one thing I didn't make clear is that a small number of the columns actually do contain floating point values and those columns can't be converted. If you don't mind I'll borrow your data to clear up the issue. – MikeBopf Oct 22 '19 at 00:02
  • Added data to my question and modified you code to include floating point data fields. – MikeBopf Oct 22 '19 at 00:39
  • @MikeBopf I updated, it was a small change. I think your expected output in the question is slightly off. – ALollz Oct 22 '19 at 13:55
  • I fixed my expected output - bad cut and paste. I'm still trying to get your solution to work with my data because I have a couple special cases. I think it will work, though. – MikeBopf Oct 22 '19 at 22:13
  • 1
    I've had hardware issues so haven't been able to test it on my full dataset, but the solution works on the test data and I like the cleverness of the answer. – MikeBopf Oct 26 '19 at 14:12