Python : Summary of statistics from multiple statistics files

Question

I have dataset of about 140,000,000 records which I have stored in the database. I need to compute basic statistics such as mean, max , min, standard deviation on these data using python.

But when I do so using chunks something like "Select * from Mytable order by ID limit %d offset %d" % (chunksize,offset), the execution takes more than an hour and still executing. Referring from How to create a large pandas dataframe from an sql query without running out of memory?

Since it takes more time, Now I have decided to read only few records and save the statistics obtained using pandas.describe() into a csv. Likewise for the entire data I will have multiple csvs containing only the statistics.

Is there a way to merge these csvs to get the overall statistics for the entire data of 140,000,000 ?

Can't you use the db directly for these tasks? E. g. See: https://blogs.msdn.microsoft.com/zainnab/2009/03/17/calculating-mean-median-and-mode-with-sql-server/ — fgoettel, Dec 12 '16 at 14:14
Why not open the csvs into pandas and use the pandas.concat function to combine the data? see: http://pandas.pydata.org/pandas-docs/stable/merging.html — Joey, Dec 12 '16 at 14:22
@fgoettel Thanks for pointing me to the link. It was very helpful. — Arun, Dec 12 '16 at 15:09

score 0 · Answer 1 · answered Dec 12 '16 at 14:31

In this case (to compute mean, max, min, SD) for huge dataset split in different files you can compute what you need (mean, max ,etc...) keep only the result and then open a second file, and compute (mean, max, etc...) taking in account result from your first files, etc...

score 0 · Answer 2 · answered Nov 16 '18 at 04:41

Have you tried using pickle? Save and load in pickle format, and use pandas data frame to calculate the summary statistics.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_pickle.html

And if this don't work, then perhaps re-visit the objectives as to why need to capture such a large dataset and breakdown by categories, time period or something more meaningful.

Python : Summary of statistics from multiple statistics files

2 Answers2