0

To create a large pandas DataFrame (where each entry in the dataframe is a float and the data and dataframes are on the order of 30,000 rows and and tens of columns), from a dictionary can be done in a short amount of time by calling:

import pandas as pd
df = pd.DataFrame(my_dict)

This df object is created very quickly (about 0.05 seconds).

Additionally, saving and recalling the data frame using to_pickle and read_pickle can be done quickly.

df.to_pickle(save_path)  # takes ~2.5 seconds
reloaded_df = pd.read_pickle(save_path) # takes 0.1 seconds

However, when I try to do any operations on reloaded_df, it takes an unreasonable amount of time and memory. For example, calling:

 reloaded_df.head()  # Takes many minutes to run and uses a lot of RAM.

Why is reloading the data frame so quick, but operating on it take so long? Also, what would be a work-around so calling reloaded_df.head() returns quickly after reloading the data frame?

The question How to store a dataframe using Pandas does not address my question because they do not discuss the delay in using the pandas dataframe after reloading it from a pickle file.

I am using python 3.5, pandas version 0.22 and Windows 10.

sdrinz23
  • 43
  • 6
  • I've worked with similar sized dataframes and never had this problem. Does `df.iloc[:10]` take equally long? – jpp Feb 28 '18 at 17:07
  • Thank you for pointing this out. df.iloc[:10] also took a very long time, which lead me to realize my dataframe was many orders of magnitude larger than I had thought. It was quick to load, which I had originally though implied it was a reasonable size, However, thanks to Yserbius comment, I realize lazy evaluation caused it to be created and loaded quickly. – sdrinz23 Feb 28 '18 at 17:41

1 Answers1

0

Not certain, but it's possible that this is due to the fact that the whole purpose of pandas is that not all the data is loaded into memory at once. Also, there's compression involved when using DataFrame IO operations. What may be happening is that pandas is just doing a lazy load on the file, not reading it into memory until it's accessed.

Yserbius
  • 1,375
  • 12
  • 18
  • 1
    For anyone who views this question in the future, the problem was not with the pickling, but that my dataframe was many orders of magnitude larger than I had thought. Lazy evaluation caused it to be created and loaded quickly, which I had originally thought meant it was a reasonable size. – sdrinz23 Feb 28 '18 at 17:44
  • @sdrinz23: if this whole thing is useless because the problem was misstated, please just delete the question. – tom10 Sep 25 '20 at 00:58
  • I agree, but stackoverflow won't let me delete it. – sdrinz23 Sep 26 '20 at 16:59