2

I am using the pairwise_cor function to generate word correlations in text. The problem is, the output contains duplicate rows when taking into account the combination of two columns. I want to filter out the duplicated rows.

Here is an extremely simplified example:

item1 <- c("dog", "cat", "horse", "cow", "chicken", "sheep")
item2 <- c("cat", "dog", "cow", "horse", "sheep", "chicken")
correlation <- c(2,2,3,3,4,4)

df <- cbind.data.frame(item1, item2, correlation)
df


    item1   item2 correlation
1     dog     cat     2
2     cat     dog     2
3   horse     cow     3
4     cow   horse     3
5 chicken   sheep     4
6   sheep chicken     4

So essentially, rows 1 and 2 are the same, just with item1 and item2 flip-flopped. And the same for rows 3 and 4, and rows 5 and 6.

I want the output data frame, after filtering to be:

    item1   item2 correlation
1     dog     cat     2
2   horse     cow     3
3 chicken   sheep     4

It is not just a simple matter of using unique(), at least that I can figure.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
CKlubes
  • 23
  • 2

1 Answers1

1

Here is one option using the sqldf package. Assuming you don't care about which correlation you retain from the duplicates, we can try:

library(sqldf)

sql <- "SELECT MIN(item1, item2) AS item1, MAX(item1, item2) AS item2,
               MAX(correlation) AS correlation
        FROM df
        GROUP BY MIN(item1, item2), MAX(item1, item2)"

output <- sqldf(sql)

Given that your duplicated rows may not always be adjacent, this sqldf option might be useful to you, because in SQL tables are based on unordered records. So, the adjacent requirement is not a problem.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360