10

I'm hearing different views on when one should use Pandas vs when to use SQL.

I tried to do the following in Pandas on 19,150,869 rows of data:

for idx, row in df.iterrows():
    tmp = int((int(row['M']) / PeriodGranularity))+1
    row['TimeSlot'] = str(row["D"]+1) + "-" + str(row["H"]) + "-" + str(tmp)

And found it was taking so long I had to abort after 20 minutes.

I performed the following in SQLLite:

Select strftime('%w',PlayedTimestamp)+1 as D,strftime('%H',PlayedTimestamp) as H,strftime('%M',PlayedTimestamp) as M,cast(strftime('%M',PlayedTimestamp) / 15+1 as int) as TimeSlot from tblMain

and found it took 4 seconds ("19150869 rows returned in 2445ms").

Note: For the Pandas code I ran this in the step before it to get the data from the db:

sqlStr = "Select strftime('%w',PlayedTimestamp)+1 as D,strftime('%H',PlayedTimestamp) as H,strftime('%M',PlayedTimestamp) as M from tblMain"
df = pd.read_sql_query(sqlStr, con)

Is it my coding that's at fault here or is it generally accepted that for certain tasks SQL is a lot faster?

Reddspark
  • 6,934
  • 9
  • 47
  • 64

2 Answers2

9

It seems you can use vectorize solution (PeriodGranularity is some variable):

df['TimeSlot'] = (df["D"]+1).astype(str) + "-" + 
                  df["H"].astype(str) + "-" + 
                 ((df['M'].astype(int) / PeriodGranularity).astype(int)+1).astype(str)

And for parse datetime to str use strftime.

DataFrame.iterrowsis really slow - check this.

First some comaprison of code for users coming from SQL background.

Comapring 2 technologies is really hard and I am not sure if some nice answer in SO (too broad reasons), but I find this.

elPastor
  • 8,435
  • 11
  • 53
  • 81
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks for the quick responses, appreciated. Giving it a whirl. Ps. Think you meant to write .astype(str) at the end there? – Reddspark Jun 22 '17 at 09:31
  • I add it because `str(tmp)` – jezrael Jun 22 '17 at 09:32
  • 2
    Ok the results are in, the above method took 90 seconds. Definitely an improvement but not as fast as 3 seconds for SQLLite. I also checked my memory usage during the time and it hovered around 65%. Anyway was an interesting little exercise to help me make form a view. Happy to benchmark other alternatives if people have it, but generally I'm leaning towards SQLLite for this task. – Reddspark Jun 22 '17 at 09:53
  • Hmmm, it is also hard question if `database` like `SQLLite` or `pandas` is better/faster. In my opinion some task can be better/easier/faster in [sql](https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html), but some are simplier implemented in pandas (e.g. `pivot`). But I have no sql background, so better comparison is hard for me. – jezrael Jun 22 '17 at 10:50
  • I see this is the accepted answer, even though it doesn't answer the question at all. I wanted to find out when it would be beneficial to choose SQL over Pandas, or vice versa. – wobbily_col Aug 11 '17 at 14:34
  • @wobbily_col - it is not accepted, not nike sign. I try improve it. – jezrael Aug 11 '17 at 14:36
  • @wobbily_col - Iedit answer, please check it. Not sure if need something more complex... – jezrael Aug 11 '17 at 14:41
  • I am looking for more users experience than a syntax comparison. I saw a couple of posts saying that Pandas was fine until the data set wouldn't fit in memory for example (with SQL it's usually fine while the index fits in memory, so I could see that being a plus for SQL, but more detail would be preferred). String handling is way nicer in Python than in SQL. – wobbily_col Aug 11 '17 at 14:45
  • @wobbily_col - very hard question for me, especially I have no sql background. So not so easy answering for me :( But in my opinion if need data processing with small data (check Highlights in [bottom of page](http://pandas.pydata.org/index.html)). And maybe help also [this](https://www.quora.com/What-is-the-difference-between-pandas-and-SQL) – jezrael Aug 11 '17 at 14:55
2

Pandas dataframes are not meant to be iterated this way. This is an excellent source to better understand what should be used for efficiency. In short, numpy vectorization is the way to go whenever possible, otherwise pandas apply() function is still many times faster than iterrows(). Would be interesting to see the comparison results after refactoring.