1

I am looking for a way to add a column (almost like a sequence column) to a data set that indicates every change of a specific column. I found this very good solution here: Increment by 1 for every change in column in R and it worked perfectly for most of the observations.

My data set has 18 columns and about 320'000 rows. To make it easier it looks like the following (including the result):

df <- data.frame(var1= c(1, 0, 1, 0, 0, 1, 0, 0, 0), sequence=c(1, 2, 3, 4, 4, 5, 6, 6, 6))

I used the following piece of code and it worked well for my example above:

df$seq <- cumsum(c(1,as.numeric(diff(df$var1))!=0))

However, I recognized that sometimes my new column (the seq column) changes its value, even though the other column (var1) does not!

enter image description here

Is there anything wrong with the cumsum(c(1,as.numeric(diff(df$var1))!=0)) command or has it to do with problems in my data?

As I am quiet new to R, I would be grateful if somebody could help me with this.

Sotos
  • 51,121
  • 6
  • 32
  • 66
Julius
  • 11
  • 2
  • 2
    Where does that happen on your example? It seems to be working fine – Sotos Nov 03 '17 at 12:44
  • @Sotos thank you for your reply. Well in my original data there is var1 = 201893210853 and the seq column changes from 1874 to 1875 and then moves on to 1876 even though var1 stays at 201893210853. Maybe there is some problem with my data because the code worked really well for the first half but then somehow the seq column begins to increment by one without any change in var1... – Julius Nov 03 '17 at 13:23
  • 1
    You need to provide a reproducible example that captures that error in order for us to help you. Don't put anything in comments (I mean data). Just update your question – Sotos Nov 03 '17 at 13:24
  • 1
    That shouldn't happen. The code seems correct (you can also double check the results using `data.table::rleid` function). – Sotos Nov 03 '17 at 15:19
  • 2
    @Sotos You can't have an integer that large, I guess, trying `201893210853L`. OP should use bit64::integer64, maybe, or just be more careful to ensure that numbers displayed as integers are integers, and not actually `201893210853 + 0:1/10` or something. – Frank Nov 03 '17 at 15:30
  • @Frank yup! That explains it. – Sotos Nov 03 '17 at 15:34
  • @Frank&@Sotos thank you for your answers! Could you maybe explain shortly what exactly you mean by "OP should use bit64::integer64"? – Julius Nov 03 '17 at 15:46
  • 1
    Even when you have a number after the decimal point, R will often hide it from display, eg, `x = 1999995.5` displays as 1999996 by default. To see if this is why you're having this problem, you could look at `x[(x %% 1) != 0]`. A "careful" approach would avoid floating-point numbers for grouping values or rows, since they're unreliable... you could work with `x2 <- round(x)` instead of `x` or look at the bit64 package (which supports larger integers than R does, though I haven't used it myself). Maybe see https://stackoverflow.com/questions/9508518/why-are-these-numbers-not-equal – Frank Nov 03 '17 at 15:54
  • Now I got the solution. Perfect, you helped me a lot! I transformed the column var1 to an `as.integer64(df$var1)` from the bit64 package and then used again the above mentioned code: `cumsum(c(1,as.numeric(diff(df$var1))!=0))`. The following linked helped me with the integer transformation:[struggling with integers (maximum integer size)](https://stackoverflow.com/questions/14589354/struggling-with-integers-maximum-integer-size) Shall I delete my question or is it worth to leave it here? – Julius Nov 03 '17 at 17:18

0 Answers0