This is different from error using astype when NaN exists in a dataframe because I need to keep the NaN values, so I've chosen to use the experimental IntegerArray. The crux of this issue is trying to avoid the looping.
We have a number of large medical datasets that I am importing into Pandas from SAS. The majority of the fields are enumerated types and should be represented as integers, but they are coming in as float64 because many contain NaN values. The experimental IntegerArray type in Pandas solves the NaN problem. However, these datasets are very large and I'd like to convert them in a script based on the data itself. The following script works, but is extremely slow and I have figured out a more Pythonic or "Pandorable" way of writing it.
# Convert any non-float fields to IntegerArray (Int)
# Note than IntegerArrays are an experimental addition in Pandas 0.24. They
# allow integer columns to contain NaN fields like float columns.
#
# This is a rather brute-force technique that loops through every column
# and every row. There's got to be a more efficient way to do it since it
# takes a long time and uses up a lot of memory.
def convert_integer (df):
for col in df.columns:
intcol_flag = True
if df[col].dtype == 'float64': # Assuming dtype is "float64"
# TODO: Need to remove inner loop - SLOW!
for val in df[col]:
# If not NaN and the int() value is different from
# the float value, then we have an actual float.
if pd.notnull(val) and abs(val - int(val)) > 1e-6:
intcol_flag = False
break;
# If not a float, change it to an Int based on size
if intcol_flag:
if df[col].abs().max() < 127:
df[col] = df[col].astype('Int8')
elif df[col].abs().max() < 32767:
df[col] = df[col].astype('Int16')
else: # assuming no ints greater than 2147483647
df[col] = df[col].astype('Int32')
print(f"{col} is {df[col].dtype}")
return df
I assumed that the inner for loop was the problem, but I tried replacing it with:
s = df[col].apply(lambda x: pd.notnull(x) and abs(x - int(x)) > 1e-6)
if s.any():
intcol_flag = False
and it is still just as slow.
Here is some sample data and the desired output:
np.random.seed(10)
df = pd.DataFrame(np.random.choice([1, 2, 3.3, 5000, 111111, np.NaN], (3,9)),
columns=[f'col{i}' for i in range(9)])
df
col0 col1 col2 col3 col4 col5 col6 col7 col8
0 2.0 NaN 111111.0 1.0 2.0 5000.0 111111.0 2.0 NaN
1 1.0 NaN 2.0 3.3 1.0 2.0 1.0 3.3 1.0
2 111111.0 5000.0 1.0 111111.0 5000.0 1.0 5000.0 3.3 2.0
And the result should be:
col0 is Int32
col1 is Int16
col2 is Int32
col3 is float64
col4 is Int16
col5 is Int16
col6 is Int32
col7 is float64
col8 is Int8