I have searched for the answer and the internet seems very conflicting in this matter.
The logic of what I am doing is taking clumps of data from a database, performing calculations on the data in memory and writing the results to a csv file. I always thought pandas was the fastest way but posts like this:What is the fastest and most efficient way to append rows to a DataFrame? possibly suggest otherwise.
So in essence, I pull 5-10 datasets(all have a date column) of 1000-10000 rows and put them in dataframes. I then create new columns that are calculations from existing columns. Up to this point I would say dataframes are the way to go due to those calculations(moving averages etc). At this point, ALL dataframes have the same amount of rows and start dates.
I then go through each dataset row by row and compare to be sure dataframe row is the same date and use a condition and if that condition is true, I do a calculation using data from that row in each dataframe and save those calculations.
Once the calculations are complete, I want to write a summary to a csv file.
The next step is to do this in a loop, save the summary and compare summaries so this could take possibly hours of processing time therefore any optimization I can do will help(yes I will thread it)
The part that currently takes the longest, is going through the dataframe row by row. It even takes longer than doing all the calculations on all the dataframes.
The code I use cycle through each row is:
for i in df.index:
if df.iloc[i]['Date'] != df2.iloc[i]['Date'] etc.
break #throw critical error
Calculations using df.iloc[i]['data column'] to pull the data from the dataframe
I have read posts to both sides that say this is fastest, other say itertuples are faster..
Once the calculations are complete I will throw it into a dictionary/list/dataframe(fastest way?) and then write to csv
Hopefully I have explained myself.