Pandas vectorization instead of loop for two dataframes

Question

I have 2 dataframes. My main dataframe dffinal

        date  id  och  och1  och2  och3  cch1  LCH  L#
0  3/27/2020   1 -2.1     3     3     1     5  NaN NaN
1   4/9/2020   2  2.0     1     2     1     3  NaN NaN

My second dataframe df2

        date  och  cch  och1  och2  och3  cch1
0  5/30/2012 -0.7 -0.7     3    -1     1    56
1  9/16/2013  0.9 -1.0     6     4     3     7
2  9/26/2013  2.5  5.4     2     3     2     4
3  8/26/2016  0.1 -0.7     4     3     5    10

I have this loop

for i in dffinal.index:    
    df3=df2.copy()
    
    df3 = df3[df3['och1'] >dffinal['och1'].iloc[i]]
    df3 = df3[df3['och2'] >dffinal['och2'].iloc[i]]
    df3 = df3[df3['och3'] >dffinal['och3'].iloc[i]]    
    
    df3 = df3[df3['cch1'] >dffinal['cch1'].iloc[i]]     
    
    dffinal['LCH'][i] =df3["och"].mean()
    dffinal['L#'][i] =len(df3.index)

As it is clear from my code the values of LCH and L# are obtained from df2(df3) based on above conditions.

This code works very well, but it is very slow. I found out that i can improve efficiency with pandas vectorization. However, I could not figure out how to do it for my case.

This is my desired result

        date  id  och  och1  och2  och3  cch1       LCH   L#
0  3/27/2020   1 -2.1     3     3     1     5  0.900000  1.0
1   4/9/2020   2  2.0     1     2     1     3  1.166667  3.0

I would greatly appreciate if you could help me to increase the efficiency of my code

Correct answer

I personally use the answer of @shadowtalker easy method, simply because I can undesrtand how it works.

The most efficient answer is fast but complex

It helps a lot if you can post the data in CSV or JSON format, so that people can easily load it and test out their answers. Fixed width is less ideal. — shadowtalker, Jun 30 '21 at 15:17
@shadowtalker sorry, I was trying to do the question according to this https://stackoverflow.com/a/20159305/15542251 guide. Not sure if I understand you correctly. `diffinal` is simply my first dataframe — Bogdan Titomir, Jun 30 '21 at 15:24
See [include a minimal data frame](https://stackoverflow.com/questions/52413246/how-to-provide-a-reproducible-copy-of-your-dataframe-with-to-clipboard) for how to include a data frame with your code. Make it easy for others to help you. — Prune, Jun 30 '21 at 15:28
@Prune, thanks, next time I will use this as my main guide for posting questions — Bogdan Titomir, Jun 30 '21 at 15:32
That is only one support item. Please continue to use [on topic](https://stackoverflow.com/help/on-topic) and [how to ask](https://stackoverflow.com/help/how-to-ask) from the [intro tour](https://stackoverflow.com/tour). — Prune, Jun 30 '21 at 15:41

Clay Shwery · Answer 1 · 2021-06-30T15:19:57.790

3

It may be very difficult to avoid iterration with the logic you have in place to select a subset of rows in df2 for a given dffinal row, but you should be able to speed up the iterative method (hopefully by a lot) using this.

(note: if you're repeatedly accessing the row of the dataframe you're iterating through, use .iterrows so you can grab things much more simply (and quickly)

for i,row in dffinal.iterrows():
    och_array = df2.loc[(df3['och1'] >row['och1']) &\
          (df2['och2'] >row['och2']) &\
          (df2['och3'] >row['och3']) &\   
          (df2['cch1'] >row['cch1']),'och'].values
    dffinal.at[i,'LCH'] = och_array.mean()
    dffinal.at[i,'L#'] = len(och_array)

This avoids lookups in dffinal, avoids creating a new copy of the df several times over. Can't test this without a data sample, but I think this will work.

edited Jun 30 '21 at 15:19

answered Jun 30 '21 at 15:02

Clay Shwery

380
1
8

Thank you, apparently it is not possible to use vectorization in my case. I tried your code, but i get this error "None of [Index(['LCH'], dtype='object')] are in the [columns]". `for i,row in dffinal.iterrows(): df_stats = df2.loc[(df2['och1'] >row['och1']) & (df2['och2'] >row['och2']) & (df2['och3'] >row['och3']) & (df2['cch1'] >row['cch1']),['LCH']].mean() dffinal.at[i,'LCH'] = df_stats['LCH']` I used this code – Bogdan Titomir Jun 30 '21 at 15:13
1

Note that `itertuples` should be even faster than `iterrows`, and might be more "dtype-safe" as well. – shadowtalker Jun 30 '21 at 15:14
@shadowtalker could you please show me example, i have never tried itertuples before – Bogdan Titomir Jun 30 '21 at 15:15
1

@BogdanTitomir I edited the code, I didn't read carefully enough how you calculated LCH and L#, I think it should work now – Clay Shwery Jun 30 '21 at 15:20
1

Thank you, your code works perfectly and it significantly improved the performance of my code. I chose @shadowtalker answer as correct simply because it is slightly faster. Unfortunately, I have to choose only one answer as correct – Bogdan Titomir Jun 30 '21 at 16:23

shadowtalker · Answer 2 · 2021-06-30T15:38:11.083

This answer is based on https://stackoverflow.com/a/68197271/2954547, except that it uses itertuples instead of iterrows. itertuples is generally safer than iterrows, because it preserves dtypes correctly. See the "Notes" section of the DataFrame.iterrows documentation.

It also is self-contained, in that it can be executed top-to-bottom without having to copy/paste data, etc.

Note that I iterate over df1.itertuples and not df_final.itertuples. Never mutate something that you are iterating over, and never iterate over something that you are mutating. Modifying a DataFrame in-place is a form of mutation.

import io

import pandas as pd


data1_txt = """
     date  id  och  och1  och2  och3  cch1  LCH  L#
3/27/2020   1 -2.1     3     3     1     5  NaN NaN
4/9/2020   2  2.0     1     2     1     3  NaN NaN
"""

data2_txt = """
     date  och  cch  och1  och2  och3  cch1
5/30/2012 -0.7 -0.7     3    -1     1    56
9/16/2013  0.9 -1.0     6     4     3     7
9/26/2013  2.5  5.4     2     3     2     4
8/26/2016  0.1 -0.7     4     3     5    10
"""

df1 = pd.read_fwf(io.StringIO(data1_txt), index_col='id')
df2 = pd.read_fwf(io.StringIO(data2_txt))

df_final = df1.copy()

for row in df1.itertuples():
    row_mask = (
        (df2['och1'] > row.och1) &
        (df2['och2'] > row.och2) &
        (df2['och3'] > row.och3) &
        (df2['cch1'] > row.cch1)
    )
    och_vals = df2.loc[row_mask, 'och']
    i = row.Index
    df_final.at[i, 'LCH'] = och_vals.mean()
    df_final.at[i, 'L#'] = len(och_vals)

print(df_final)

The output is

         date  och  och1  och2  och3  cch1  LCH  L#       LCH   L#
id                                                                
1   3/27/2020 -2.1     3     3     1     5  NaN NaN  0.900000  1.0
2    4/9/2020  2.0     1     2     1     3  NaN NaN  1.166667  3.0

I haven't really spent much time answering questions here until this week, but will definitely steal that StringIO technique for getting print statement output from questions into my notebooks. Also thanks for showing this example. My suspicion regarding optimization is that iterrows vs itertuples will not make a big difference because the vast majority of the computation is in selecting the correct rows from df2, but it's cool to see this implementation! — Clay Shwery, Jun 30 '21 at 15:35
Good point. The main benefit of `itertuples` is not speed but dtype-safety. See the "Notes" section of https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html — shadowtalker, Jun 30 '21 at 15:36
How can I add a new line `df_final.at[i, 'SumPosNeg']=` which needs to be equal to sum of all positive `och` values divided by the sum of all negative `och` values or `sum(och>0)/sum(och<0)` — Bogdan Titomir, Jun 30 '21 at 16:58
I.e. the new column results should be equal to 2.0/(-2.1)=-0.95. I got 2.0 and 2.1 from the final results dataframe. — Bogdan Titomir, Jun 30 '21 at 17:01
@BogdanTitomir it might be worthwhile to spend some time understanding mine and Clay's answers, so that you can add your own extensions or modifications as needed. — shadowtalker, Jun 30 '21 at 19:28
@shadowtalker, thanks, I have figured it out. Your code was the easiest to understand, and I decided to go with your solution — Bogdan Titomir, Jun 30 '21 at 20:16

anky · Answer 3 · 2021-06-30T15:53:29.740

3

Only way I can think of by pandas methods without loops is a cross join after resetting the index and comparing with df.all(1)

cols = ['och1','och2','och3','cch1']
u = df2.reset_index().assign(k=1).merge(
    dffinal.reset_index().assign(k=1),on='k',suffixes=('','_y'))
#for new Version of pandas there is a how='cross' included now

dffinal['NewLCH'] = (u[u[cols].gt(u[[f"{i}_y" for i in cols]].to_numpy()).all(1)]
                     .groupby("index_y")['och'].mean())

print(dffinal)

        date  id  och  och1  och2  och3  cch1  LCH  L#    NewLCH
0  3/27/2020   1 -2.1     3     3     1     5  NaN NaN  0.900000
1   4/9/2020   2  2.0     1     2     1     3  NaN NaN  1.166667

edited Jun 30 '21 at 15:53

answered Jun 30 '21 at 15:48

anky

74,114
11
41
70

I tried your code, but I have error that says "Unable to allocate 39.9 GiB for an array with shape (5358055840,) and data type int64", which I understand is this code requires 40 GiB ram. Or I did something wrong – Bogdan Titomir Jun 30 '21 at 15:53
1

@BogdanTitomir yes, I wouldnot recommend using a cross join for big dataframes (it takes a lot of space) , may be select only relevant columns (cols and och) before reset index and try – anky Jun 30 '21 at 15:54
1

I think this method is too complicated for my brain:) but anyway thank you! – Bogdan Titomir Jun 30 '21 at 15:57

Shubham Sharma · Accepted Answer · 2021-06-30T17:57:46.720

2

Here is one way to approach your problem

def fast(A, B):
    for a in A:
        m = (B[:, 1:] > a[1:]).all(1)
        yield B[m, 0].mean(), m.sum()

c = ['och', 'och1', 'och2', 'och3', 'cch1']
df1[['LCH', 'L#']] = list(fast(df1[c].to_numpy(), df2[c].to_numpy()))

        date  id  och  och1  och2  och3  cch1       LCH  L#
0  3/27/2020   1 -2.1     3     3     1     5  0.900000   1
1   4/9/2020   2  2.0     1     2     1     3  1.166667   3

edited Jun 30 '21 at 17:57

answered Jun 30 '21 at 17:15

Shubham Sharma

68,127
6
24
53

Thank you very much, the code gets the job done and much faster than my original code, but surprisingly a little slower than other answers. The code looked very complex and I expected it to be faster than other answers. But maybe i did something wrong and corrupted the code:) – Bogdan Titomir Jun 30 '21 at 17:28
1

@BogdanTitomir Not sure about the problem but in my tests i found this to be `5-6x` faster. By the way what are the shapes of dataframe `df1` and `df2`? – Shubham Sharma Jun 30 '21 at 17:31
Test sample `df1` has 12,000 rows. `df2` has 450,000 – Bogdan Titomir Jun 30 '21 at 17:42
1

Thank you, it turns out to be the fastest method. I chose this as the correct answer, but probably I will go with other methods simply because this method feels too complicated to me. – Bogdan Titomir Jun 30 '21 at 18:19
1

Glad it worked for you. Although few more optimizations can be possible but that will only make the code more complex. – Shubham Sharma Jun 30 '21 at 18:23
Last question. If i had another column in `df2`, let's call it `new`. How should i edit the code to get `df1[['LCH', 'Lnew', 'L#']]`, where new is also the mean of column `new` – Bogdan Titomir Jun 30 '21 at 18:29
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/234384/discussion-between-shubham-sharma-and-bogdan-titomir). – Shubham Sharma Jun 30 '21 at 18:32

Pandas vectorization instead of loop for two dataframes

4 Answers4