I develop a program that processes big data.
While researching how to reduce the memory usage of dataframes, I found something strange.
I tried to load one column in big sas7bdat file, so I test memory usage on different types,
Load by Series
0 b'Richard Wilkins'
1 b'Richard Wilkins'
2 b'Richard Wilkins'
3 b'Richard Wilkins'
4 b'Richard Wilkins'
...
16539995 b'Rebecca Davis'
16539996 b'Rebecca Davis'
16539997 b'Rebecca Davis'
16539998 b'Rebecca Davis'
16539999 b'Rebecca Davis'
Length: 16540000, dtype: object
Result Series Data Memory usage is 264640000 Bytes -> With pd.Series.memory_usage()
Result Series Frame Memory usage is 1024200016 Bytes -> With sys.getsizeof(Series)
Total Usage is 2740.828125 MB and time is 45.28001594543457 seconds -> Pysical memory with psutil.Process().memory_info().rss
Load by DataFrame with pd.concat (https://github.com/pandas-dev/pandas/issues/37088#issuecomment-708321128)
0 b'Richard Wilkins'
1 b'Richard Wilkins'
2 b'Richard Wilkins'
3 b'Richard Wilkins'
4 b'Richard Wilkins'
...
16539995 b'Rebecca Davis'
16539996 b'Rebecca Davis'
16539997 b'Rebecca Davis'
16539998 b'Rebecca Davis'
16539999 b'Rebecca Davis'
Name: name, Length: 16540000, dtype: object
Result DF Data Memory usage is 132320128 Bytes
Result DF Frame Memory usage is 891880144 Bytes
Total Usage is 3775.93359375 MB and time is 41.85226058959961 seconds
According to these results, Series data's memory usage is 252MB, DataFrame data's memory usage is 126MB, exactly double. also Series Object memory usage is 976MB, DataFrame Object memory usage is 850MB, also DataFrame is lower.
However, for physical memory, it is indicated that the data frame uses more memory. What kind of situation is this, and what memory usage should I trust in development?
My tried this with
# sas_read_column_series.py
import os
import time
from sys import getsizeof
import pandas as pd
import psutil
file_path = os.path.dirname(os.path.realpath(__file__)) + os.sep
df_iter = pd.read_sas(
file_path + "test.sas7bdat", chunksize=1000000)
# Column is idx, name, age, gender, country, birthdate, email
ps = psutil.Process()
start_time = time.time()
result_series = pd.Series()
while True:
df_block = df_iter.read()
print(
f"Now Usage is { ps.memory_info().rss / 1024 / 1024 } MB and time is { time.time() - start_time } seconds")
if len(df_block) == 0:
break
result_series = pd.concat([result_series, df_block["name"]])
end_time = time.time()
print(f"Result Series is { result_series }")
print(
f"Result Series Data Memory usage is { result_series.memory_usage() } Bytes")
print(
f"Result Series Frame Memory usage is { getsizeof(result_series) } Bytes")
print(
f"Total Usage is { ps.memory_info().rss / 1024 / 1024 } MB and time is { end_time - start_time } seconds")
# sas_read_column_dataframe.py
import os
import time
from sys import getsizeof
import pandas as pd
import psutil
file_path = os.path.dirname(os.path.realpath(__file__)) + os.sep
df_iter = pd.read_sas(
file_path + "test.sas7bdat", chunksize=1000000)
# Column is idx, name, age, gender, country, birthdate, email
ps = psutil.Process()
start_time = time.time()
result_df = []
while True:
df_block = df_iter.read()
print(
f"Now Usage is { ps.memory_info().rss / 1024 / 1024 } MB and time is { time.time() - start_time } seconds")
if len(df_block) == 0:
break
result_df.append(df_block["name"])
result_df = pd.concat(result_df, axis=0)
end_time = time.time()
print(f"Result DF is { result_df }")
print(
f"Result DF Data Memory usage is { result_df.memory_usage() } Bytes")
print(
f"Result DF Frame Memory usage is { getsizeof(result_df) } Bytes")
print(
f"Total Usage is { ps.memory_info().rss / 1024 / 1024 } MB and time is { end_time - start_time } seconds")