0

I am fairly new to the art of programming (loops etc..) and this is something where I would be grateful if I could get an opinion whether my approach is fine or it would definitely need to be optimized if it was about to used on much bigger sample.

Currently I have approximately 20 000 observations and one of the columns is the ID of receipt. What I would like to achieve is to assign each row to a group that would consist of IDs that are ascending in a format of n+1. If this rule is broken the new group should be created until the rule is broken again.

To illustrate, lets say I have this table (Important note is that ID are not necessarily unique and can repeat, like ID 10 in my example):

MyTable <- data.frame(ID = c(1,2,3,4,6,7,8,10,10,11,17,18,19,200,201,202,2010,2011,2013))

MyTable

    ID
    1
    2
    3
    4
    6
    7
    8
    10
    10
    11
    17
    18
    19
    200
    201
    202
    2010
    2011
    2013

The result of my grouping should be following:

    ID    GROUP
    1      1
    2      1
    3      1
    4      1
    6      2
    7      2
    8      2
    10     3
    10     3
    11     3
    17     4
    18     4
    19     4
    200    5
    201    5
    202    5
    2010   6
    2011   6
    2013   7

I used dplyr for ordering the ID in ascending way. Then created the variable MyData$Group which I have simply filled with 1's.

rep(1,length(MyTable$ID)

for (i in 2:length(MyTable$ID) ) {
  if(MyTable$ID[i] == MyTable$ID[i-1]+1 | MyTable$ID[i] == MyTable$ID[i-1]) {
    MyTable$ID[i] <- MyTable$GROUP[i-1]
     } else {
       MyTable$GROUP[i] <- MyTable$GROUP[i-1]+1
  }
}

This code worked for me and I got the results fairly easily. However, I wonder if in eyes of more experienced programmers, this piece of code would be considered as "bad", "average", "good" or whatever rating you come up with.

EDIT: I am sure this topic has been touched already, not arguing against that. Though, as the main difference is that I would like to touch a topic of optimization here and see whether my approach meets standards.

Thanks!

VKosik
  • 11
  • 3
  • if it works it's good – Nate Nov 11 '16 at 12:53
  • using a for loop in R has always been avoided! but yeah from a beginner's perspective its good. Now try to improve the efficiency by thinking from "apply" family's view – joel.wilson Nov 11 '16 at 12:54
  • *"based on not breaking +1 rule"* is called *"sequential IDs"* or *"consecutive IDs"* :) Also, don't tag this [tag:rstudio] unless it specifically has something to do with RStudio (the IDE), which in this case it doesn't. – smci Nov 11 '16 at 13:36

1 Answers1

4

To make a long story short:

MyTable$Group <- cumsum(c(1, diff(MyTable$ID) != 1))
#     ID Group
#1     1     1
#2     2     1
#3     3     1
#4     4     1
#5     6     2
#6     7     2
#7     8     2
#8    10     3
#9    11     3
#10   12     3
#11   17     4
#12   18     4
#13   19     4
#14  200     5
#15  201     5
#16  202     5
#17 2010     6
#18 2011     6
#19 2013     7

You are searching all differences in your vector Mytable$ID, which are not 1, so this are your "breaks". And then you cumsum all these values. When you do not know cumsum so type ?cumsum. That's all!

UPDATE: with repeating IDs, you can use this:

MyTable <- data.frame(ID = c(1,2,3,4,6,7,8,10,10,11,17,18,19,200,201,202,2010,2011,2013))
MyTable$Group <- cumsum(c(1, !diff(MyTable$ID) %in% c(0,1) ))

#     ID Group
#1     1     1
#2     2     1
#3     3     1
#4     4     1
#5     6     2
#6     7     2
#7     8     2
#8    10     3
#9    10     3
#10   11     3
#11   17     4
#12   18     4
#13   19     4
#14  200     5
#15  201     5
#16  202     5
#17 2010     6
#18 2011     6
#19 2013     7
J_F
  • 9,956
  • 2
  • 31
  • 55
  • 1
    I don't think you need to bring his mother into this :) How about you edit your question to include an example of these repeated IDs. It is impossible for people to anticipate these little issues if you don't include them – Nate Nov 11 '16 at 13:46
  • As Nathan wrote .... how should we know this! So make a GOOD reproducible example! – J_F Nov 11 '16 at 13:48
  • I have updated my answer ... – J_F Nov 11 '16 at 14:27