1

I need to process quite huge dataframe ~80M records, and essentially memory consumption is an issue. So columns with numerical data are shrank to possible minimum dtype, like np.int8, np.int16, np.int32. at some stage I need to compute new column, using some math from existing columns, and it needs capacity of int64. that where most pandas arithmetic constructions fail. I took me a while to track that the reason was integer overflow: in simple words, calculations like

   newCol = col16*col16, 
   newCol = col32*value16, 

produce often incorrect result, despite newCol is created as int64. here are some simple explicit example: calculate newCol = A * 100000, which obviously for any A=aaaaa should compute to value like aaaaa00000.

however, see below:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1<<7,1<<15, size=(int(5))), columns=list('A'), dtype=np.int16)
df.eval('Q = A * 100000', inplace=True) # 1st naive approach from a head
df['W'] = df['A'] * 100000

# trying to use const c=int64() to force expr evaluator to use int64
c = np.int64(10000)
df.eval('R = @c * A', inplace=True)     

# trying to create new int64 col 1st and use it in calc:
df['T']=0    # this creates new col 'T' dtype=int64 filled with 0
df.eval('T = 100000 * A', inplace=True)

df['S']=0    
# trying to force int64  via 1st element 'S', which is int64
df['S'] = df['S'] + df['A'] * 100000 

# here finally this approach works, calculation is using int64 instructions:
df['X']=1   
df.eval('X = X * 100000 * A', inplace=True)

# just preformatting
pd.set_option('display.max_columns', None)
pd.options.display.width=222
df.index=[''] * len(df)

print(df)
df.info()

A           Q           W           R           T           S           X
   3396   339600000   339600000   339600000   339600000   339600000   339600000
  26508 -1644167296 -1644167296 -1644167296 -1644167296 -1644167296  2650800000
  27942 -1500767296 -1500767296 -1500767296 -1500767296 -1500767296  2794200000
   3441   344100000   344100000   344100000   344100000   344100000   344100000
  27880 -1506967296 -1506967296 -1506967296 -1506967296 -1506967296  2788000000
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries,  to 
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       5 non-null      int16
 1   Q       5 non-null      int32
 2   W       5 non-null      int32
 3   R       5 non-null      int32
 4   T       5 non-null      int64
 5   S       5 non-null      int64
 6   X       5 non-null      int64
dtypes: int16(1), int32(3), int64(3)
memory usage: 230.0+ bytes

with 6 different ways to do trivial math, only col 'X' produced what I (and I'd guess most users) expected.

obvious way to solve would be, at first to convert source column's dtype to int64 (like doing it 'on the fly') like

df['x'] = df['A'].astype(np.int64) * 100000

but I don't see its a good solution in my case, as data already is to big to create an even bigger tmp copy, and performance is also will degrade with redundant conversions..

So my question: how to calculate it on the fly, not creating a copy of whole source data columns in int64 (don't have RAM for that), e.g. directly properly calc newCol64 = srcCol8 * srcCol16 * srcCol16 in 64 bits cpu.

is there explicit type translation syntaxis available for pandas.eval()? which could be done on the fly while computing result row by row?

Fedor
  • 31
  • 5
  • As a general rule, `numpy`, and `pandas` by extension, does not do 'on-the-fly' conversions. It takes a building block approach, converting whole arrays, with copy as needed, and the doing the multiply. Operations like multiply are done in compiled code, with a limited number of type mixes (doubles, floats etc). Depending on the compilation it might not even have a compiled options for involving shorts. It's difficult to micro manage memory use when doing a sequence of calculations. – hpaulj Jun 15 '23 at 17:19
  • @hpaulj How then pandas.eval() make calculations with values of different dtypes (e.g. x=A*B*C, where A B C are int8,int16,int32...)? does it creates first a normalized copies of src data with same dtype (I doubt that)? then why can't it estimate, that int16*in16*int16 need to be calculated to int64, not int 16...at least report a warning or overflow error.. currently it silently calculates nonsense in this case. – Fedor Jun 23 '23 at 16:05

1 Answers1

1

If you multiply an array by a scalar value, the selected dtype will be that of the array. However if you do a math operation between two arrays like X (X * A), the array will be "upcast":

>>> df['A'] * c
0   -31392
1    14992
2   -14064
3     6944
4   -13904
Name: A, dtype: int16  # bad result, bad dtype

>>> df['A'] * [c]
0    196380000
1    288570000
2    317770000
3    168500000
4     76270000
Name: A, dtype: int64  # everything is ok

You can read the documentation about output type determination.

In your case, you can do:

c = np.array([10000])
df.eval('R = @c * A', inplace=True)

Output:

>>> df
       A          R
0  12399  123990000
1   6026   60260000
2  17133  171330000
3  30974  309740000
4  28216  282160000

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       5 non-null      int16
 1   R       5 non-null      int64
dtypes: int16(1), int64(1)
memory usage: 178.0 bytes

EDIT

For some reason on my machine df['A'] * [100000] is calculated as int32, despite python is 64bit.. (and so result is wrong). Using c=np.array([100000],dtype=np.int64) instead of [100000] is works though.

Even if your system is 64bit, it seems the default int type is 32bit in your case (and I don't know why). To check the default dtype, you can use:

# My system
>>> np.int_
numpy.int64

# Your system
>>> np.int_
numpy.int32

In your case, if you do df['A'] * [10000], [10000] will be converted as np.array([10000], dtype=np.int32) that's why the result will be wrong. You have to be explicit by creating the array with np.array([10000], dtype=np.int64).

Corralien
  • 109,409
  • 8
  • 28
  • 52
  • try it with 100 000 then please. I've got same overflow issue. because dtype of c is [int32] (at least in my env). 10k was just not enough for int32 overflow. – Fedor Jun 23 '23 at 12:07
  • P.S. c=np.array([100000],dtype=np.int64) is working though. thanks! I intuitively was trying this approach with c=np.int64(10000), but it didn't succeed – Fedor Jun 23 '23 at 12:13
  • @Fedor. `c = np.int64(10000)` is still a scalar value even if its is type is `int64` so the result dtype will be `np.int32` if your array dtype is `np.int32`. You have to multiply A by an array to avoid overflow issue. – Corralien Jun 24 '23 at 04:23
  • Ok. It's clear with scalars (but intuitively unexpected and IMHO wrong).. what about df['S']=df['S']+df['A']*100000, here S is int64, the resulted dtype is int64, but values are still wrong!, since it seems df['A']*100000 was calculated separately in 32bit mode. I understand the reason now, but still it was very unexpected to meet such issue with basic arithmetic's in modern package. some explicit typecast in expressions for eval() would be very helpful for a "quick fix", but better if the parser would do it implicitly – Fedor Jun 26 '23 at 06:38
  • *the resulted dtype is int64, but values are still wrong*. Of course yes, multiplication has priority over addition, so `df['A'] * 100000` was calculated first so the result is already wrong. So in this case, use `df['S'] + df['A'] * [100000]` – Corralien Jun 26 '23 at 06:50
  • for some reason on my machine df['A'] * [100000] is calculated as int32, despite python is 64bit.. (and so result is wrong). Using c=np.array([100000],dtype=np.int64) instead of [100000] is works though. – Fedor Jun 27 '23 at 12:59
  • What is the output of `print(np.int_)` on your machine, please? – Corralien Jun 27 '23 at 13:20
  • That's why if you don't explicitly set the dtype the default is int32. For my system, the default dtype is int64 so `[10000]` is converted as `np.array([10000], dtype=np.int64)` while is converted as `np.array([10000], dtype=np.int32)` for you. – Corralien Jul 01 '23 at 04:49
  • If this answer solved your problem, take the time to read [What should I do when someone answers my question?](https://stackoverflow.com/help/someone-answers) – Corralien Jul 01 '23 at 04:59
  • Thanks, found good topic where default np.int type is discussed: https://stackoverflow.com/questions/36278590/numpy-array-dtype-is-coming-as-int32-by-default-in-a-windows-10-64-bit-machine . mental( – Fedor Jul 03 '23 at 15:31