Matching non-sequential combinations of strings in R and collapsing data frame based on combos

Question

I have data like this:

table <- data.frame(col1 = c("USA", "CHN", "DEU", "DEU"), col2 = c("DEU", "DEU", "USA", "CHN"), col3 = c(100, 150, 300, 250))

  col1 col2 col3
1  USA  DEU  100
2  CHN  DEU  150
3  DEU  USA  300
4  DEU  CHN  250

How do I collapse this table based on the combination of values in col1 and col2 independent of the sequence of this combination -- so that I have the following table?

  col1 col2 col3
1  USA  DEU  400
2  CHN  DEU  400

Many thanks for any advice you could provide

UPDATE: See updated data here:

AUS AUS 431.92164
AUS AUS 581.14942
AUS AUS 1822.10345
AUS AUS 1024.4436
AUS AUS 126.45221
AUS AUS 171.01716
AUS AUS 432.23415
AUS AUS 35.28675
AUS AUS 1488.02559
AUS AUS 0
AUS AUS 9057.73836
AUS AUS 9933.79762
AUS AUS 4307.49845
AUS AUS 73.33633
AUT AUT 1089.27094
AUT AUT 413.83978
AUT AUT 118.46822
AUT AUT 3700.34366

Essentially the same as identifying duplicates here - https://stackoverflow.com/questions/25297812/pair-wise-duplicate-removal-from-dataframe/25298863 - `paste(do.call(pmax, table[1:2]), do.call(pmin, table[1:2]))` for instance creates the order-independent grouping variable. — thelatemail, Apr 19 '18 at 03:52

Maurits Evers · Accepted Answer · 2018-04-19T03:55:34.910

1

Here is a base R solution using aggregate:

table[, 1:2] <- t(apply(table[, 1:2], 1, sort))    
aggregate(col3 ~ col1 + col2, table, sum)
#  col1 col2 col3
#1  CHN  DEU  400
#2  DEU  USA  400

Explanation: The first line re-orders entries in the first two columns using sort, before using aggregate to sum col3 entries based on entries in col1 and col2.

Or using a tidyverse approach:

library(tidyverse);
table %>%
    mutate_if(is.factor, as.character) %>%
    rowwise() %>%
    mutate(tmp = paste(sort(c(col1, col2)), collapse = "_")) %>%
    ungroup() %>%
    group_by(tmp) %>%
    summarise(col3 = sum(col3)) %>%
    separate(tmp, c("col1", "col2"))
## A tibble: 2 x 3
#  col1  col2   col3
#  <chr> <chr> <dbl>
#1 CHN   DEU    400.
#2 DEU   USA    400.

Update

With your updated data

table <- read.table(text =
"AUS AUS 431.92164
AUS AUS 581.14942
AUS AUS 1822.10345
AUS AUS 1024.4436
AUS AUS 126.45221
AUS AUS 171.01716
AUS AUS 432.23415
AUS AUS 35.28675
AUS AUS 1488.02559
AUS AUS 0
AUS AUS 9057.73836
AUS AUS 9933.79762
AUS AUS 4307.49845
AUS AUS 73.33633
AUT AUT 1089.27094
AUT AUT 413.83978
AUT AUT 118.46822
AUT AUT 3700.34366", header = F);
colnames(table) <- c("col1", "col2", "col3");

table[, 1:2] <- t(apply(table[, 1:2], 1, sort))
aggregate(col3 ~ col1 + col2, table, sum)
#    col1 col2      col3
#  1  AUS  AUS 29485.005
#  2  AUT  AUT  5321.923

edited Apr 19 '18 at 03:55

answered Apr 19 '18 at 03:01

Maurits Evers

49,617
4
47
68

I'm getting the following warning message for first command in your base R approach: Warning messages: 1: In `[<-.data.table`(`*tmp*`, , 1:2, value = c("AUS", "AUS", "AUS", : 2 column matrix RHS of := will be treated as one vector 2: In `[<-.data.table`(`*tmp*`, , 1:2, value = c("AUS", "AUS", "AUS", : Supplied 9940 items to be assigned to 4970 items of column 'CountryFrom' (4970 unused) 3: In `[<-.data.table`(`*tmp*`, , 1:2, value = c("AUS", "AUS", "AUS", : 2 column matrix RHS of := will be treated as one vector ... – nymuffin Apr 19 '18 at 03:14
@nymuffin There is no `"AUS"` in the sample data you provided. First please confirm that the solution works with the sample data you provided. If your data is not representative please update your sample data. It's hard to tell what's going on without knowing your data. Have you got missing entries? – Maurits Evers Apr 19 '18 at 03:16
Hi there, the base R solution worked for the initial sample data I provided -- but but unfortunately didn't work with the actual data set I was working with. I've edited my initial post to include a new sample of data (the base R solution didn't work for this new sample) – nymuffin Apr 19 '18 at 03:52
@nymuffin Works just fine with your updated data (see my updated answer). Try it out yourself by copy & pasting the code I posted. – Maurits Evers Apr 19 '18 at 03:56

Matching non-sequential combinations of strings in R and collapsing data frame based on combos

1 Answers1

Update