Estimate pandas dataframe size without loading into memory

Question

Is there a way to estimate the size a dataframe would be without loading it into memory? I already know that I do not have enough memory for the dataframe that I am trying to create but I do not know how much more memory would be required to fully create it.

Check this: https://stackoverflow.com/questions/18089667/how-to-estimate-how-much-memory-a-pandas-dataframe-will-need — fsl, Nov 15 '19 at 21:05
alws_cnfsd, you can convert dataframes to json which can be placed in a database like mongodb, have you thought about doing something like that so you can paginate your data without having to use so much memory? — oppressionslayer, Nov 15 '19 at 21:20
I am trying to make a new padded data frame with all possible combinations of my dataset based on other data frames that I have read in from an SQL query. I know that it will be over 22 million rows but and 18 columns. — alws_cnfsd, Nov 16 '19 at 20:35
@oppressionslayer I have not worked with json but I will look into it, thank you for the suggestion. — alws_cnfsd, Nov 16 '19 at 20:35
I really recommend the DB solution when the data set get's large. If your ever looking at it seriously and have questions on how to port the data, lmk and i'll be happy to answer it, DB questions are my favorite. I upvoted your question because i think these type of considerations are important to have answers for — oppressionslayer, Nov 16 '19 at 23:09
@oppressionslayer thank you, I believe you are right. I am struggling to wrap my head around how I would utilize a DB method since I am only used to pandas data frames. I can read and write the data using SQL to and from databases to data frames just fine. If/When I have specific questions I may reach back out to you for some ideas — alws_cnfsd, Nov 18 '19 at 22:43

score 2 · Answer 1 · answered Nov 15 '19 at 21:30

2

You can calculate for one row, and estimate based on it:

data = {'name': ['Bill'], 
        'year': [2012], 
        'num_sales': [4]}
df = pd.DataFrame(data, index = ['sales'])
df.memory_usage(index=True).sum() #-> 32

answered Nov 15 '19 at 21:30

oppressionslayer

6,942
2
7
24

score -1 · Answer 2 · answered Nov 15 '19 at 21:27

I believe you're looking for df.memory_usage, which would tell you how much each column will occupy.

Altogether it would go something like:

df.memory_usage().sum()

Output:

123123000

You can do more specifics things like including Index (Index = True) or using the Deep feature which will "introspect the data deeply". Feel free to check the documentation!

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.memory_usage.html

Estimate pandas dataframe size without loading into memory

2 Answers2