Count number of duplicates in a data.table by two columns in R

Question

I'm trying to count the number of duplicates of each unique string value in the column z by two other columns (x,y) in a data.table (using the data.table package or something equivalently fast, I have millions of actual rows to run this on):

I have data like this:

dt <- data.table(x=c("aa","aa","aa","bb","cc","cc","cc","cc","cc","cc"), y=c(2,2,1,1,1,1,2,2,2,3),z=c("d","d","a","d","a","a","e","e","b", "a")) 

     x y z
 1: aa 2 d
 2: aa 2 d
 3: aa 1 a
 4: bb 1 d
 5: cc 1 a
 6: cc 1 a
 7: cc 2 e
 8: cc 2 e
 9: cc 2 b
10: cc 3 a

I'd like to have it like this:

dt.desired <- data.table(x=c("aa","aa", "bb","cc", "cc","cc", "cc"), y=c(1,2,1,1,2,2,3), z=c("a","d","d","a","b","e","a"), n=c(1,2,1,2,1,2,1))


    x y z n
1: aa 1 a 1
2: aa 2 d 2
3: bb 1 d 1
4: cc 1 a 2
5: cc 2 b 1
6: cc 2 e 2
7: cc 3 a 1

`dt[, .N, keyby = names(dt)]`? – David Arenburg Jun 16 '18 at 22:00 — David Arenburg, Jun 16 '18 at 22:00

score -1 · Answer 1 · answered Jun 16 '18 at 22:06

You can do this with dplyr and magrittr in tidyverse:

library(data.table)
library(tidyverse)

> dt %>% count(x,y,z)
# A tibble: 7 x 4
  x         y z         n
  <chr> <dbl> <chr> <int>
1 aa       1. a         1
2 aa       2. d         2
3 bb       1. d         1
4 cc       1. a         2
5 cc       2. b         1
6 cc       2. e         2
7 cc       3. a         1

If you want to create a new data frame, just assign to a variable like

z <- dt %>% count(x,y,z)

Count number of duplicates in a data.table by two columns in R

1 Answers1