data.table or dplyr - data manipulation

Question

I have the following data

Date           Col1       Col2
2014-01-01     123        12
2014-01-01     123        21
2014-01-01     124        32
2014-01-01     125        32
2014-01-02     123        34
2014-01-02     126        24
2014-01-02     127        23
2014-01-03     521        21
2014-01-03     123        13
2014-01-03     126        15

Now, I want to count unique values in Col1 for the each date (that did not repeat in previous date), and add to the previous count. For example,

Date           Count
2014-01-01       3 i.e. 123,124,125
2014-01-02       5 (2 + above 3) i.e. 126, 127
2014-01-03       6 (1 + above 5) i.e. 521 only

The output I am looking for is Date and Count columns shown above. — BigDataScientist, Jan 28 '14 at 21:43

score 17 · Accepted Answer · edited Jan 30 '14 at 10:31

17

library(dplyr)
df %.% 
  arrange(Date) %.% 
  filter(!duplicated(Col1)) %.% 
  group_by(Date) %.% 
  summarise(Count=n()) %.% # n() <=> length(Date)
  mutate(Count = cumsum(Count))
# Source: local data frame [3 x 2]
# 
#         Date Count
# 1 2014-01-01     3
# 2 2014-01-02     5
# 3 2014-01-03     6

library(data.table)
dt <- data.table(df, key="Date")
dt <- unique(dt, by="Col1")
(dt <- dt[, list(Count=.N), by=Date][, Count:=cumsum(Count)])
#          Date Count
# 1: 2014-01-01     3
# 2: 2014-01-02     5
# 3: 2014-01-03     6

Or

dt <- data.table(df, key="Date")
dt <- unique(dt, by="Col1")
dt[, .N, by=Date][, Count:=cumsum(N)]

.N is named N (no dot) automatically for convenience in chained operations like this, so you can use both .N and N together in the next operation if need be.

edited Jan 30 '14 at 10:31

Matt Dowle

58,872
22
166
224

answered Jan 28 '14 at 21:43

lukeA

53,097
5
97
100

8

Great! Thank you. I like data.table option better. – BigDataScientist Jan 28 '14 at 21:50
2

Thanks @Arun. I'm rather new to data table. But I'm starting to love it. :) – lukeA Jan 28 '14 at 21:54
2

It may be more efficient (in `dplyr`) to use `filter` to de-duplicate on `col1` , i.e. `df %.% arrange(Date) %.% filter(!duplicated(Col1)) %.% group_by(Date) %.% summarise(Count=length(Date)) %.% mutate(Count = cumsum(count))`. – mnel Jan 28 '14 at 21:57
1

I'm with @mnel - the dplyr code would look a lot better if you consistently used `%.%` – hadley Jan 28 '14 at 22:46
@mnel The use of `n()` instead of `length` will be probably better too. cc @hadley – dickoa Jan 28 '14 at 23:03
@lukeA, I have another column now which is named 'Type' in that I have 'Type1', 'Type2' and 'Type3'. I want cumulative sum based on 'Type' as well. – BigDataScientist Feb 11 '14 at 17:52
@lukaA, I figured this out, it will be `dt <- data.table(df, key=c("Date",'Type')) dt <- unique(dt, by="Col1") (dt <- dt[, list(Count=.N), by=c('Date','Type')][, Count:=cumsum(Count), by=c('Type')])` – BigDataScientist Feb 11 '14 at 17:59

MatthieuBizien · Answer 2 · 2014-01-28T21:46:37.133

0

With ddply and duplicated, you just have to do

df <- ddply(data, .(Date, Col1), nrow)
df2 <- ddply(df[!duplicated(df$Col1),], .(Date), nrow)
ddply(df2, .(Date, V1), nrow)

ie you first count for all couples Date, Col1, then you remove the duplicated columns. You finally count the colums.

Your data must be sorted before.

edited Jan 28 '14 at 21:46

answered Jan 28 '14 at 21:38

MatthieuBizien

1,647
1
10
19

data.table or dplyr - data manipulation

2 Answers2

Linked