2

I try to update a pandas dataframe which has 3 million rows. At the below, I reduced my problem into a more simple problem. In short, it does add values in a cummulative sense.

But, this function takes too long time for me like more than 10 hours in a real problem. Is there any room for speeds up? Should I update it only at the last?

Can we update the pandas dataframe with a more faster way than a iterrows()?

Can we select multiple rows by their index and then updates?

def set_r(group, i, colname, add):
    if colname in group:
        prev = group.iloc[i][colname]
        if  math.isnan(prev):
            group.set_value(i, colname, add)
        else:
            group.set_value(i, colname, prev+add)
    else:
        group.set_value(i, colname, add)

def set_bl_info(group, i, r, bl_value, timeframe, clorca, bl_criteria):

    group.set_value(i, timeframe + '_' + bl_criteria, True)
    colname = timeframe + '_' + clorca + '_' + 'bb_count_'+ bl_criteria
    set_r(group, i, colname, 1)

def bl_assign(days, bl_key, bl_value, group, bl_p05, bl_p01):
    print bl_key
    sub_group = group[(group.pledged_date >= bl_value[0]) & (group.pledged_date <= bl_value[1])]
    coexisting_icl = sub_group[(sub_group.project_category == bl_value[2]) & (sub_group.cluster == bl_value[3])]

    for i, r in coexisting_icl.iterrows():
        set_bl_info(group, i, r, bl_value, 'coexisting', 'icl','p1')


# main function
bl_assign(days, bl_key, bl_value, group, bl_p05, bl_p01)

For more simplicity, my problem is something like below:

    A   B    C
0   0   0   False
1   7   0   True
2   8   0   True
3   5   0   True

Update B column if C is true with sum of A column's elements

    A   B    C     
0   0   0   False
1   7   20   True
2   8   20   True
3   5   20   True

After then, if D is also true then update B with sum of E in cumulatively

    A   B    C      D    E 
0   0   0   False False  1
1   7   20   True False  1
2   8   20   True True   1
3   5   20   True True   1

    A   B    C      D    E 
0   0   0   False False  1
1   7   20   True False  1
2   8   22   True True   1
3   5   22   True True   1
JonghoKim
  • 1,965
  • 7
  • 21
  • 44
  • 3
    Have you tried doing it with http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.cumsum.html ? – Khris Sep 01 '16 at 12:36
  • Actually, the problem is a little bit different from a pure cumulative sum, I try to give the sum of values repeatedly over bl_value "s". Therefore it is kinds of a sum of sum with complex conditions. But your comments provoke me a another possibility of faster algorithm. I would borrow idea of it. – JonghoKim Sep 01 '16 at 12:42
  • 1
    Is using Cython possible in your case? – bartoszukm Sep 01 '16 at 13:16
  • I would apply Cython. This link may helpful for me http://stackoverflow.com/questions/30270117/how-to-apply-cython-to-pandas-dataframe – JonghoKim Sep 01 '16 at 13:18
  • @JonghoKim "After then, if D is also true then update B with sum of E in cumulatively" Can you explain how the sum at the end is cumulative? I don't see it. – Ami Tavory Sep 01 '16 at 13:27
  • Some of "B" has 22 which is from 20 + 2 (values are from "E" 1+1 who satisfy "true" in "D"), not just update "B" into 2 – JonghoKim Sep 01 '16 at 13:28
  • 1
    @JonghoKim Just an FYI - "cumulatively" usually is used in the sense of an accumulation by the rows, so a cumulative sum somehow would usually mean you're finding the partial sums of the rows. – Ami Tavory Sep 01 '16 at 13:36

1 Answers1

3

Update B column if C is true with sum of A column's elements

import numpy as np

df['B'] = np.where(df.C, df.A.sum(), 0)

After then, if D is also tru then update B with the sum of E (using the comment to the question above)

df.B = df.B + np.where(df.D, (df.E * df.D.astype(int)).sum(), 0)

So, at the end you have

>>> df
   A      C   B  E      D
0  0  False   0  1  False
1  7   True  20  1  False
2  8   True  22  1   True
3  5   True  22  1   True
Ami Tavory
  • 74,578
  • 11
  • 141
  • 185