1

I'm trying to group my data by a very specific condition. Consider below data.frame:

from <- c("a", "b", "a", "b")
to <- c("b", "a", "b", "a")
give <- c("x", "y", "y", "x")
take <- c("y", "x", "x", "y")
amount <- c(1, 2, 3, 4)
df <- data.frame(from, to, give, take, amount)

which creates something like:

  | from | to   | give | take | amount
---------------------------------------
1 | a    | b    | x    | y    | 1
2 | b    | a    | y    | x    | 2
3 | a    | b    | y    | x    | 3
4 | b    | a    | x    | y    | 4

To provide some background: consider some user in the 'from' column giving something (in column 'give') to the user in 'to' column and taking something in return (in column 'take'). As you might see, rows 1 & 2 are the same in that way, because they describe the same scenario, just form another perspective. Therefore, I want these to belong to the same group. (You could also consider them as duplicates, which involves the same task, i.e. identifying them as similar.) The same holds for rows 3 & 4. The amount is some value to be summed up per group, to make the example clear.

My desired result for grouping them is as follows.

| user1 | user2 | given_by_user1 | taken_by_user1 | amount
-----------------------------------------------------------
| a     | b     | x              | y              | 3       # contains former rows 1&2
| a     | b     | y              | x              | 7       # contains former rows 3&4

Note that both from&to and give&take need to by inverted, i.e. taking the values from two columns, sorting their values and considering them equal on that basis is not what I need. This would lead to all four rows in above example being considered equal. That kind of solution was proposed in similar posts, e.g.:

Remove duplicates where values are swapped across 2 columns in R

I've read many similar solutions and found one which actually does the trick:

match two columns with two other columns

However, the proposed solution creates an outer product of two columns, which is not feasible in my case, because my data has millions of rows and at least thousands of unique values within each column.

(Any solution that either groups the rows directly, or gets the indices of rows belonging to the same group would be great!)

Many thanks for any suggestions!

marialagorda
  • 45
  • 2
  • 7

0 Answers0