I am using pandas for my day to day work and some of the data frames I use are overwhelmingly big (in the order of hundreds of millions of rows by hundreds of columns). Is there any way of reducing the RAM memory consumption?
3 Answers
You can use this function. It reduces the size of the data by clamping the data types to the minimum required for each column.
The code is not mine, I copied it from the following link and I adapted it for my needs. https://www.mikulskibartosz.name/how-to-reduce-memory-usage-in-pandas/
def reduce_mem_usage(df, int_cast=True, obj_to_category=False, subset=None):
"""
Iterate through all the columns of a dataframe and modify the data type to reduce memory usage.
:param df: dataframe to reduce (pd.DataFrame)
:param int_cast: indicate if columns should be tried to be casted to int (bool)
:param obj_to_category: convert non-datetime related objects to category dtype (bool)
:param subset: subset of columns to analyse (list)
:return: dataset with the column dtypes adjusted (pd.DataFrame)
"""
start_mem = df.memory_usage().sum() / 1024 ** 2;
gc.collect()
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
cols = subset if subset is not None else df.columns.tolist()
for col in tqdm(cols):
col_type = df[col].dtype
if col_type != object and col_type.name != 'category' and 'datetime' not in col_type.name:
c_min = df[col].min()
c_max = df[col].max()
# test if column can be converted to an integer
treat_as_int = str(col_type)[:3] == 'int'
if int_cast and not treat_as_int:
treat_as_int = check_if_integer(df[col])
if treat_as_int:
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.uint8).min and c_max < np.iinfo(np.uint8).max:
df[col] = df[col].astype(np.uint8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.uint16).min and c_max < np.iinfo(np.uint16).max:
df[col] = df[col].astype(np.uint16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.uint32).min and c_max < np.iinfo(np.uint32).max:
df[col] = df[col].astype(np.uint32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
elif c_min > np.iinfo(np.uint64).min and c_max < np.iinfo(np.uint64).max:
df[col] = df[col].astype(np.uint64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
elif 'datetime' not in col_type.name and obj_to_category:
df[col] = df[col].astype('category')
gc.collect()
end_mem = df.memory_usage().sum() / 1024 ** 2
print('Memory usage after optimization is: {:.3f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df

- 2,018
- 1
- 14
- 21
-
10This code is missing the imports (gc, tqdm) and the definition of "check_if_integer", maybe even more. – Vega Aug 20 '20 at 14:29
-
7This is also inherently flawed, it completely ignores the float precision which may be (and likely is) required. – Jayme Gordon Aug 15 '21 at 02:02
-
messed up the precision. The 75% memory reduction seemed nice.. Until I noticed that all of the numbers (related to finance) are either approximated in a wrong way or bluntly having different value from the source (precision lost after some decimal). This should probably be addressed in the "accepted" answer. – Alex Nov 05 '22 at 19:48
-
Can you please give some examples and a reproducible sample to debug and under stand your concern? – ivallesp Nov 07 '22 at 06:17
Consider using Dask DataFrames if your data does not fit memory. It has nice features like delayed computation and parallelism, which allow you to keep data on disk and pull it in a chunked way only when results are needed. It also has a pandas-like interface so you can mostly keep your current code.

- 7,792
- 8
- 44
- 64
-
1Does it work well in general? I tried it like 2 years ago and it was very buggy – ivallesp Aug 16 '19 at 21:52
-
It works pretty well for me. But I often follow the "Reduce, and then use Pandas" suggestion here https://docs.dask.org/en/latest/dataframe-best-practices.html , so for the more intricate parts of the analysis I may default to pandas itself. – foglerit Aug 16 '19 at 21:59
If are working with dataframe with numeric value you could consider using the downcast
option of apply
. Its not as efficient as the accepted solution (only 50% reduction) but it is more simple and faster. I do not have problems of precision loss beacuse Im converting float64 to float32 and not float16.
Here is my original dataframe memory usage :
df.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 644 entries, 0 to 643
Columns: 1028 entries, 0 to 1027
dtypes: float64(1012), int64(16)
memory usage: 5.1 MB
Then I use the apply function :
df = df.apply(pd.to_numeric, downcast='float')
df = df.apply(pd.to_numeric, downcast='integer')
Here is modified dataframe memory usage :
df.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 644 entries, 0 to 643
Columns: 1028 entries, 0 to 1027
dtypes: float32(1012), int8(16)
memory usage: 2.5 MB

- 150
- 11