0

I try to count the occurence of pairs, created by combining columns, while the order of the elements within the pairs is not important.

For example: - Data table with two columns with names (V1, V2) - Combine the columns to make pairs - Count the pairs (ignoring order within each pair): A_B = B_A

set.seed(126)
dt <- data.table(V1 = sample(LETTERS[1:4], 30, replace = T),
                 V2 = sample(LETTERS[1:4], 30, replace = T))

#Exclude rows with the same name 
dt <- dt[V1 != V2]

#Create pairs by combining V1 and V2
dt[, pair := paste(V1, V2, sep="_")]

#Count the pairs 
dt[, .N, by=.(pair)]

Result:

#     pair N
# 1:   C_A 1
# 2:   B_C 6
# 3:   C_B 3
# 4:   C_D 1
# 5:   A_B 2
# 6:   D_C 4
# 7:   A_C 2
# 8:   B_A 2
# 9:   B_D 1
# 10:  A_D 1
# 11:  D_B 2

Expected result:

  • e.g. C_A and A_C should be treated as the same pair, etc.
#     pair N
# 1:   C_A 3   #A_C
# 2:   B_C 9   #C_B
# 3:   C_D 5   #D_C
# 4:   A_B 4   #B_A
# 5:   B_D 3   #D_B
# 6 :  A_D 1

Any suggestions to perform this operation with data.table

AkselA
  • 8,153
  • 2
  • 21
  • 34
Steve
  • 47
  • 2
  • 7

1 Answers1

1

If we always write the earliest letter in the alphabet first in the pair := assignment, the code will produce the desired result. We'll use ifelse() to decide whether to write V1 before V2 as follows.

library(data.table)
set.seed(126)
dt <- data.table(V1 = sample(LETTERS[1:4], 30, replace = T),
                 V2 = sample(LETTERS[1:4], 30, replace = T))


# adjusted version where first letter always < second letter

#Exclude rows with the same name 
dt <- dt[V1 != V2]

#Create pairs by combining V1 and V2
dt[, pair := ifelse(V1 < V2,paste(V1, V2, sep="_"), paste(V2, V1, sep = "_"))]

#Count the pairs 
dt[, .N, by=.(pair)]

...and the output:

> #Count the pairs 
> dt[, .N, by=.(pair)]
   pair N
1:  A_C 3
2:  B_C 9
3:  C_D 5
4:  A_B 4
5:  B_D 3
6:  A_D 1
>
Len Greski
  • 10,505
  • 2
  • 22
  • 33