1

I have a large dataset I want to simplify but I'm currently having some troubles with one thing. The following table shows a origin destination combination. The count column, represents the amount of occurrences of A to B for example.

From    To  count
A       B     2
A       C     1
C       A     3
B       C     1

The problem I have is that for example A to C (1), is actually the same as C to A (3). As direction doesn't really matter to me only that there's a connection between A and C, I wonder how can I simply have A to C (4).

The problem is that I have a factor with 400 levels, so I can't do it manually. Is there something with dplyr or similar that can solve this for me?

h3rm4n
  • 4,126
  • 15
  • 21
FilipeTeixeira
  • 1,100
  • 2
  • 9
  • 29

2 Answers2

2
df[1:2] <- t(apply(df[1:2], 1, sort))

aggregate(count ~ From + To, df, sum)

results in:

  From To count
1    A  B     2
2    A  C     4
3    B  C     1
h3rm4n
  • 4,126
  • 15
  • 21
  • Thank you. It works perfect and it's actually quite simple. May I ask why the "t(apply(". I understand that it sorts the columns but I'm not really sure why not going directly for the aggregate. – FilipeTeixeira Mar 03 '17 at 22:09
  • @FilipeTeixeira you need to operate (the sorting) on the rows, that is why you need `apply(df[1:2], 1, sort)`; in the resulting matrix the original columns are now rows, so you need to transpose them back with the `t`-function – h3rm4n Mar 04 '17 at 09:34
  • I get it now. Thank you :). – FilipeTeixeira Mar 04 '17 at 15:44
0

Here is a base R method using aggregate, sort, paste, and mapply.

with(df, aggregate(count,
                   list(route=mapply(function(x, y) paste(sort(c(x, y)), collapse=" - "),
                                                          From, To)), sum))
  route x
1 A - B 2
2 A - C 4
3 B - C 1

Here, mapply takes pairs of elements from the from and to variables, sorts them and pastes them into a single string with collapse=TRUE. The resulting string vector is used in aggregate to group the observations and sum the count values. with reduces typing.

lmo
  • 37,904
  • 9
  • 56
  • 69
  • Thank you for the explanation. Is there any practical difference from the solutions above? It's still interesting to see a different approach, so it's not wasted at all. – FilipeTeixeira Mar 04 '17 at 15:46