0

I have searched for the answer and the internet seems very conflicting in this matter.

The logic of what I am doing is taking clumps of data from a database, performing calculations on the data in memory and writing the results to a csv file. I always thought pandas was the fastest way but posts like this:What is the fastest and most efficient way to append rows to a DataFrame? possibly suggest otherwise.

So in essence, I pull 5-10 datasets(all have a date column) of 1000-10000 rows and put them in dataframes. I then create new columns that are calculations from existing columns. Up to this point I would say dataframes are the way to go due to those calculations(moving averages etc). At this point, ALL dataframes have the same amount of rows and start dates.

I then go through each dataset row by row and compare to be sure dataframe row is the same date and use a condition and if that condition is true, I do a calculation using data from that row in each dataframe and save those calculations.

Once the calculations are complete, I want to write a summary to a csv file.

The next step is to do this in a loop, save the summary and compare summaries so this could take possibly hours of processing time therefore any optimization I can do will help(yes I will thread it)

The part that currently takes the longest, is going through the dataframe row by row. It even takes longer than doing all the calculations on all the dataframes.

The code I use cycle through each row is:

for i in df.index:
   if df.iloc[i]['Date'] != df2.iloc[i]['Date'] etc.
       break #throw critical error
   Calculations using df.iloc[i]['data column'] to pull the data from the dataframe

I have read posts to both sides that say this is fastest, other say itertuples are faster..

Once the calculations are complete I will throw it into a dictionary/list/dataframe(fastest way?) and then write to csv

Hopefully I have explained myself.

James
  • 332
  • 2
  • 3
  • 9
  • Pandas isn't really designed to loop through all the rows. You will probably want to do an apply funtion that does the comparison for the dates – user1558604 Feb 20 '20 at 17:51
  • It's difficult to suggest anything, as your usecase is very specific. In general reduce the number of times you make disk IO operations, keep all data in memory and perform transforms/compares on the dataframe, avoiding row by row operations. – monkut Feb 20 '20 at 17:53
  • Your title says database...is this a sqlite database? Could you merge the datasets first? – user1558604 Feb 20 '20 at 17:56
  • Merging them from a database i do not believe would work as the calculations like moving averages group specific. – James Feb 20 '20 at 17:58
  • The comparison for the dates could be done as a whole on the the dataframes however I do it there as I have to go through each row as I need to do various conditions on each row – James Feb 20 '20 at 18:02

0 Answers0