Creating Sequence Sum using Pandas and R

Question

Hi All i want to calculate sequence and Flag using ID,Month and Value. for every unique id if value changes to zero then sequence is 1 and if it continues to be zero for succesive months then sequence will add as shown above.

Flag will be 1 when the sequence add to 6.

Please help i want to do it using Pandas and R

Ciao! Welcome to SO. First of all you should read [here](http://stackoverflow.com/help/how-to-ask) about how to ask a good question; a good question has better likelihood to be solved and you to receive help. On the other hand a read of [this](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) is also good thing. It explains how to create a reproducible example in R. Help users to help you by providing a piece of your data, a desired output and things you have already tried. — SabDeM, Nov 05 '16 at 13:15

score 0 · Accepted Answer · edited Jun 20 '20 at 09:12

In R, we can create the 'Sequence' and 'Flag' using data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'Id', we specify the 'i' with Value == 0, create the 'Sequence' as the sequence (1:.N) based on the TRUE values in 'i'. Then, create the 'Flag' by assigning (:=) 1 for those elements that are 1 for the 'Sequence'. If there are 'Id' that have no 1 or all the 'Flag' values are NA, then concatenate 0 at the top while removing one element from 'Flag' (Flag[-1]) or else return the 'Flag'

library(data.table)
setDT(df1)[Value == 0, Sequence := 1:.N , by = Id]
df1[Sequence ==1, Flag := 1][, Flag := if(all(is.na(Flag))) 
                     c(0, Flag[-1]) else Flag, by = Id]
df1
#        Id  Month Value Sequence Flag
# 1: SCSR1 Jan-16   400       NA   NA
# 2: SCSR1 Feb-16     0        1    1
# 3: SCSR1 Mar-16     0        2   NA
# 4: SCSR1 Apr-16     0        3   NA
# 5: SCSR1 May-16     0        4   NA
# 6: SCSR1 Jun-16     0        5   NA
# 7: SCSR1 Jul-16     0        6   NA
# 8: SCCS9 Jan-16   440       NA    0
# 9: SCCS9 Feb-16  3000       NA   NA
#10: SCCS9 Mar-16   400       NA   NA
#11: SCCS9 Apr-16   100       NA   NA
#12: SCCS9 May-16   300       NA   NA
#13: SCCS9 Jun-16   400       NA   NA
#14: SCCS9 Jul-16   100       NA   NA
#15: SKHH1 Jan-16  1000       NA   NA
#16: SKHH1 Feb-16     0        1    1
#17: SKHH1 Mar-16     0        2   NA
#18: SKHH1 Apr-16     0        3   NA
#19: SKHH1 May-16     0        4   NA
#20: SKHH1 Jun-16     0        5   NA
#21: SKHH1 Jul-16     0        6   NA

NOTE: It is better to have NA as missing values than blanks ("") to preserve the class of the column.

data

df1 <- data.frame(Id = rep(c("SCSR1", "SCCS9", "SKHH1"), each = 7),
    Month = rep(c('Jan-16', 'Feb-16', 'Mar-16', 'Apr-16', 'May-16', 'Jun-16',
   'Jul-16'), 3), Value = c(400, rep(0, 6), 440, 3000, 400, 100, 300, 400,
   100, 1000, rep(0,6)))

Thanks Akrun it helped, I had 1 more query if you can help http://stackoverflow.com/questions/40453252/how-to-calculate-previous-record-with-refrence-to-2-tables-in-r — Atul Sahare, Nov 07 '16 at 07:47

score 0 · Answer 2 · answered Nov 05 '16 at 14:41

Pandas solution below.

Note that if values switch back to nonzero within an Id, the sequence column below will reflect that by replicating the last value (e.g., 1,2,3,4,5,6,6,6,6, ...). That can be cleaned up by shifting the Value column just as we do the Id column and comparing within an Id. However, the cumulative sum approach breaks down if Values can oscillate between 0 and nonzero within an Id. If that doesn't happen, this approach should be fine. I'm curious to see if there are other solutions robust to this problem.

import pandas as pd

df = pd.DataFrame({'Id':['SCSR1']*7+['SCCS9']*7+['SKHH1']*7, 'Value':[400]+[0]*6+[440,3000,400,100,300,400,100,1000]+[0]*6})
df['flag'] = 0

# create a shifted column to ensure that comparisons are made within an Id
df['Id2'] = df['Id'].shift()

# set the appropriate flag values to 1 and define the sequences
df.loc[(df['Id'] == df['Id2']) & (df['Value'] == 0), 'flag'] = 1
df['Sequence'] = df.groupby('Id')['flag'].cumsum()

@akrun already gave you a nifty R solution. Another way to do it uses the rleid function. But I like @akrun's solution better.

Creating Sequence Sum using Pandas and R

2 Answers2

data