2

I have dataset of about 140,000,000 records which I have stored in the database. I need to compute basic statistics such as mean, max , min, standard deviation on these data using python.

But when I do so using chunks something like "Select * from Mytable order by ID limit %d offset %d" % (chunksize,offset), the execution takes more than an hour and still executing. Referring from How to create a large pandas dataframe from an sql query without running out of memory?

Since it takes more time, Now I have decided to read only few records and save the statistics obtained using pandas.describe() into a csv. Likewise for the entire data I will have multiple csvs containing only the statistics.

Is there a way to merge these csvs to get the overall statistics for the entire data of 140,000,000 ?

Community
  • 1
  • 1
Arun
  • 29
  • 1

2 Answers2

0

In this case (to compute mean, max, min, SD) for huge dataset split in different files you can compute what you need (mean, max ,etc...) keep only the result and then open a second file, and compute (mean, max, etc...) taking in account result from your first files, etc...

Dadep
  • 2,796
  • 5
  • 27
  • 40
0

Have you tried using pickle? Save and load in pickle format, and use pandas data frame to calculate the summary statistics.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_pickle.html

And if this don't work, then perhaps re-visit the objectives as to why need to capture such a large dataset and breakdown by categories, time period or something more meaningful.

john doe
  • 21
  • 3