Is there a way to estimate the size a dataframe would be without loading it into memory? I already know that I do not have enough memory for the dataframe that I am trying to create but I do not know how much more memory would be required to fully create it.
Asked
Active
Viewed 2,156 times
3
-
Where are you loading it from? – user3483203 Nov 15 '19 at 21:03
-
How are you creating or reading the initial df? – Ajit Wadalkar Nov 15 '19 at 21:03
-
1Check this: https://stackoverflow.com/questions/18089667/how-to-estimate-how-much-memory-a-pandas-dataframe-will-need – fsl Nov 15 '19 at 21:05
-
alws_cnfsd, you can convert dataframes to json which can be placed in a database like mongodb, have you thought about doing something like that so you can paginate your data without having to use so much memory? – oppressionslayer Nov 15 '19 at 21:20
-
I am trying to make a new padded data frame with all possible combinations of my dataset based on other data frames that I have read in from an SQL query. I know that it will be over 22 million rows but and 18 columns. – alws_cnfsd Nov 16 '19 at 20:35
-
@oppressionslayer I have not worked with json but I will look into it, thank you for the suggestion. – alws_cnfsd Nov 16 '19 at 20:35
-
I really recommend the DB solution when the data set get's large. If your ever looking at it seriously and have questions on how to port the data, lmk and i'll be happy to answer it, DB questions are my favorite. I upvoted your question because i think these type of considerations are important to have answers for – oppressionslayer Nov 16 '19 at 23:09
-
@oppressionslayer thank you, I believe you are right. I am struggling to wrap my head around how I would utilize a DB method since I am only used to pandas data frames. I can read and write the data using SQL to and from databases to data frames just fine. If/When I have specific questions I may reach back out to you for some ideas – alws_cnfsd Nov 18 '19 at 22:43
2 Answers
2
You can calculate for one row, and estimate based on it:
data = {'name': ['Bill'],
'year': [2012],
'num_sales': [4]}
df = pd.DataFrame(data, index = ['sales'])
df.memory_usage(index=True).sum() #-> 32

oppressionslayer
- 6,942
- 2
- 7
- 24
-1
I believe you're looking for df.memory_usage
, which would tell you how much each column will occupy.
Altogether it would go something like:
df.memory_usage().sum()
Output:
123123000
You can do more specifics things like including Index (Index = True) or using the Deep feature which will "introspect the data deeply". Feel free to check the documentation!
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.memory_usage.html

Gorlomi
- 515
- 2
- 11