I'm hearing different views on when one should use Pandas vs when to use SQL.
I tried to do the following in Pandas on 19,150,869 rows of data:
for idx, row in df.iterrows():
tmp = int((int(row['M']) / PeriodGranularity))+1
row['TimeSlot'] = str(row["D"]+1) + "-" + str(row["H"]) + "-" + str(tmp)
And found it was taking so long I had to abort after 20 minutes.
I performed the following in SQLLite:
Select strftime('%w',PlayedTimestamp)+1 as D,strftime('%H',PlayedTimestamp) as H,strftime('%M',PlayedTimestamp) as M,cast(strftime('%M',PlayedTimestamp) / 15+1 as int) as TimeSlot from tblMain
and found it took 4 seconds ("19150869 rows returned in 2445ms").
Note: For the Pandas code I ran this in the step before it to get the data from the db:
sqlStr = "Select strftime('%w',PlayedTimestamp)+1 as D,strftime('%H',PlayedTimestamp) as H,strftime('%M',PlayedTimestamp) as M from tblMain"
df = pd.read_sql_query(sqlStr, con)
Is it my coding that's at fault here or is it generally accepted that for certain tasks SQL is a lot faster?