4

I have a pandas df with word counts corresponding to articles. I want to be able to be able to add another column MERGED that is based on groups of articles that have a minimum cumulative sum of 'min_words'.

df = pd.DataFrame([[  0,  6],
       [  1,  10],
       [  3,   5],
       [  4,   7],
       [  5,  26],
       [  6,   7],
       [  9,   4],
       [ 10, 133],
       [ 11,  42],
       [ 12,   1]], columns=['ARTICLE', 'WORD_COUNT'])

df
Out[15]: 
   ARTICLE  WORD_COUNT
0        0           6
1        1          10
2        3           5
3        4           7
4        5          26
5        6           7
6        9           4
7       10         133
8       11          42
9       12           1

So then if min_words = 20 this is the desired output:

    df
Out[17]: 
   ARTICLE  WORD_COUNT  MERGED
0        0           6       0
1        1          10       0
2        3           5       0
3        4           7       1
4        5          26       1
5        6           7       2
6        9           4       2
7       10         133       2
8       11          42       3
9       12           1       4

As seen above, it is possible that the final article(s) won't satisfy the min_words condition, and that's ok.

ginobimura
  • 115
  • 1
  • 5
  • Related https://stackoverflow.com/questions/56904390/restart-cumsum-and-get-index-if-cumsum-more-than-value for how to cumsum up to a value and find the indices where it should reset. – ALollz Jun 25 '20 at 23:56

1 Answers1

2

We can only do self def function

def dymcumsum(v, limit):
     idx = []
     sums = 0
     for i in range(len(v)):
         sums += v[i]
         if sums >= limit:
             idx.append(i)
             sums = 0
     return(idx)
df['New']=np.nan
df.loc[dymcumsum(df.WORD_COUNT,20),'New']=1
df.New=df.New.iloc[::-1].eq(1).cumsum()[::-1].factorize()[0]+1
 
df
   ARTICLE  WORD_COUNT  New
0        0           6    1
1        1          10    1
2        3           5    1
3        4           7    2
4        5          26    2
5        6           7    3
6        9           4    3
7       10         133    3
8       11          42    4
9       12           1    5
BENY
  • 317,841
  • 20
  • 164
  • 234