The memory of the 'read_csv' data is different from that of the original data

Question

print(ua_df)
# show
ID      classification  years   Gender
347     member          070     female
597     member          050     male

s2 = sys.getsizeof(ua_df)
print(s2)

# 6974117328 [is 6.5G]

# Original file size：842.1M
# The memory size is not equal to the original file


print(uad_dff)
# show
ID  shopCD  distance
727     27      40.22
942     27      30.76

Under the same conditions
s3 = sys.getsizeof(uad_dff)
print(s3)

# 12483776 [is 11.9M]

# Original file size：11.9M
# Equal to the memory size of the original file

Why is the memory of original file much smaller than that of read data, There is no difference in the second example, can anyone tell me why? Thank you very much！

"The memory of the 'read_csv' data is different from that of the original data" Why do you *expect* them to be the same? Why do you think they are being represented in the same exact way? If anything, the *surprising* claim you are making is that they are the same. — juanpa.arrivillaga, Sep 18 '21 at 07:05
Even they are different, the memory difference should not be so large. — lazy, Sep 18 '21 at 07:18
CSV/text files and `pandas` in memory represent data in very different ways. Often a string representation is more memory efficient but not suitable for computing. Sometimes pandas can save memory compared to a CSV file. A factor of 6x more memory in pandas is quite common. Without a sample of your data and the configuration of `read_csv` it's hard to tell. — Michael Szczesny, Sep 18 '21 at 07:24
@MichaelSzczesny puts it well. If you want to reduce the memory footprint, consider: 1) casting columns to correct data type 2) pulling only data you need (row and column options in read_csv) and 3) something like `dask` allows lazy loading from disk with a pandas-like API — anon01, Sep 18 '21 at 07:27
"Even they are different, the memory difference should not be so large" *Why do you believe that*? The differences look believable to me. — juanpa.arrivillaga, Sep 18 '21 at 07:28
You're storing short integers and repeated labels as strings inside `pandas`. Compare the memory usage with `pd.read_csv('large.csv', dtype={'classification':'category', 'age':'int16', 'Gender':'category'})`. — Michael Szczesny, Sep 18 '21 at 07:52

juanpa.arrivillaga · Accepted Answer · 2021-09-18T08:06:47.537

4

Consider item = ua_df.iat[0, 'years']. item is a str object of length 3. In a csv file on disk, which is just text, this requires 3 bytes to represent. In memory, since pandas uses object dtype to store this, it requires a pointer to a str object, just the pointer will require a machine word, 8 bytes (on a 64 bit architecture). This Python str object, on my machine, requires sys.getsizeof(item) == 54 bytes. So in memory, you require a total of 62 bytes to represent same data that was stored as text in 3 bytes.

The sort of size discrepancy you are seeing here is not something unexpected.

Consider storing numeric types. Pandas will likely use a np.int64 or np.float64, both of which require 8 bytes. But what if all your numbers are only 2-3 digits? It will be require 2-3 bytes to represent as text on disk. So it depends on the average number of decimal digits required to store them as text. Could be more or less than the uniform 8-bytes per numeric object.

edited Sep 18 '21 at 08:06

answered Sep 18 '21 at 07:48

juanpa.arrivillaga

88,713
10
131
172

Thank you for your explanation, let me understand the underlying logic. – lazy Sep 18 '21 at 08:09
I suspect that the sys.getsizeof calculation logic is not suitable for data frames. I found pandas's own functions, which can check the memory usage. – lazy Sep 18 '21 at 08:13
@lazy yes, you should be using `df.memory_usage(deep=True)` But pandas may implement `__sizeof__` to just do that. Not sure – juanpa.arrivillaga Sep 18 '21 at 08:13
ua_df.info() and ua_df.memory_usage() – lazy Sep 18 '21 at 08:13
@lazy in any case, `sys.getsizeof` generally *underestimates* what is required – juanpa.arrivillaga Sep 18 '21 at 08:14
yeah. Here's an explanation. It doesn't necessarily apply to third-party extensions. It's designed for built-in objects. https://stackoverflow.com/questions/449560/how-do-i-determine-the-size-of-an-object-in-python – lazy Sep 18 '21 at 08:34

The memory of the 'read_csv' data is different from that of the original data

1 Answers1