I have dataset of about 140,000,000 records which I have stored in the database. I need to compute basic statistics such as mean, max , min, standard deviation on these data using python.
But when I do so using chunks something like "Select * from Mytable order by ID limit %d offset %d" % (chunksize,offset), the execution takes more than an hour and still executing. Referring from How to create a large pandas dataframe from an sql query without running out of memory?
Since it takes more time, Now I have decided to read only few records and save the statistics obtained using pandas.describe() into a csv. Likewise for the entire data I will have multiple csvs containing only the statistics.
Is there a way to merge these csvs to get the overall statistics for the entire data of 140,000,000 ?