count duplicated rows, by unique row

Question

I think this pretty straightforward problam has a very simple solution, but I can't figure it out.

Lets say I've got a data.table with some duplicated rows (rows 3 and 4 are identical)

dt <- data.table( val1 = c(1,2,3,3,4,5,6), val2 = 8 )             
#    val1 val2
# 1:    1    8
# 2:    2    8
# 3:    3    8
# 4:    3    8
# 5:    4    8
# 6:    5    8
# 7:    6    8

I want to throw away the duplucated rows, keeping only unique rows, and introduce a new column val3 that indicates how often a row occurs in the original data

expected output:

dt.output <- data.table( val1 = c(1,2,3,4,5,6), val2 = 8, val3 = c(1,1,2,1,1,1) )             
#    val1 val2 val3
# 1:    1    8    1
# 2:    2    8    1
# 3:    3    8    2
# 4:    4    8    1
# 5:    5    8    1
# 6:    6    8    1

I've got the feeling I'm almost there using an update-join with unique(dt)[, val3 := ....], but I can't get the ... part to return what I want and it's driving me crazy.

s_baldur · Accepted Answer · 2019-09-27T11:15:56.630

3

Simply

dt[, .(val3 = .N), by = .(val1, val2)]

Or continuing with your first approach one could do something like the following:

unique(dt)[, val3 := dt[.SD, on = .(val1, val2), .N, by = .EACHI][, N]][]

Note that one can use a character vector also in by:

byvars <- grep("val\\d+", names(dt), value = TRUE)
dt[, .(val3 = .N), by = byvars]

edited Sep 27 '19 at 11:15

answered Sep 27 '19 at 11:08

s_baldur

29,441
4
36
69

.. of course.. thanks.. and what If i've got a large nuber of columns `val1, val2, ..., valn` and do not want to type all of them by hand into the `by = ` – Wimpel Sep 27 '19 at 11:15
1

@Wimpel :), you can use character vector in by (that you can produce with some system)... see update. – s_baldur Sep 27 '19 at 11:19
There might be several ways to dot this. This is an alternative: `dt %>% group_by(val1, val2) %>% summarize(var1 = val1[1], var2=val2[1])` – Zhiqiang Wang Sep 27 '19 at 11:22

count duplicated rows, by unique row

1 Answers1