1

This might be a trivial thing but I'm seeing a peculiar case and i thought to validate with the community.

I have a data frame with the following dimensions $pid : num $group : chr $status : chr...

df <- tibble::tribble(
   ~pid, ~group,~status,
   12,  "g1",   1,
   12,  "g2",   0,
   18,  "g3",   1,
   18,  "g1",   1,
   18,  "g2",   1
  )

Now while working on window functions I need to apply cumsum() over each group of 'pid' so Im using the following code

       r2 <- df%>%
          group_by(pid)%>%
          mutate(col = cumsum(status))

And I'm expecting r2 to be

  pid group status col
 12  g1        1          1
 12  g2        0          1
 18  g3        1          1
 18  g1        1          2
 18  g2        1          3

But my resultant r2 is not so. On the contrary

  pid group status col
  12  g1        1          1
  12  g2        0          1
  18  g3        1          2
  18  g1        1          3
  18  g2        1          4

Which to me looked like it is not creating a 'window' over the pid column. I tried converting the pid to character but still the same result.

If my understanding of cumsum is correct, what could the possible reason be for such behaviour.

As per packages are concerned, I've dplyr, plyr, sqldf, data.table, lubridate loaded in my workspace

dmi3kno
  • 2,943
  • 17
  • 31
hbabbar
  • 947
  • 4
  • 15
  • 33
  • Is `status` a `character` or `numeric` vector? Should be numeric but seems like it is character from the question. – John Paul Dec 18 '17 at 19:47
  • My bad... Status is an int. There were some other columns as well in the data which were not relevant to the question – hbabbar Dec 18 '17 at 19:53
  • You can't have a space isn `group_by`, but otherwise this gives me correct results. Maybe post a `dput` of your data; there could be something wonky in it – alistaire Dec 18 '17 at 20:05
  • Please make the input reproducible by showing the output from `dput(df)` in the question. – G. Grothendieck Dec 18 '17 at 20:06
  • 1
    Probably you loaded `plyr` after `dplyr` and ignored the warnings, so `plyr::mutate` is being used instead of `dplyr::mutate`. To verify, try using `dplyr::mutate` explicitly or checking `"mutate" %in% conflicts()`. – Gregor Thomas Dec 18 '17 at 20:11
  • Suggested dupe: https://stackoverflow.com/q/26106146/903061 – Gregor Thomas Dec 18 '17 at 20:11

1 Answers1

0

I generally use data.table with the code below: Same idea with dplyr code you wrote, but works.

df[, col := cumsum(status), pid]
Sabri Karagönen
  • 2,212
  • 1
  • 14
  • 28
  • This would work, but i was more concerned why the dplyr method wasn't working, got my ordering correct now – hbabbar Dec 19 '17 at 04:33