3

I have a dataframe containing variables on trade flows between pairs of countries, one country being the exporter and one being the importer for each row.

I want to create an ID number variable which identifies each unordered country pair, giving the same ID number to each pair regardless of which is the exporter and which is the importer. So Australia-United States would have the same ID as United States-Australia but a different ID to Australia-Great Britain.

This is an example of what the data with the ID variable would look like.

YEAR     ISO_EXP     ISO_IMP     UNORD_PAIR_ID
1970     AUS         GBR         1
1970     AUS         USA         2
1970     AUS         ZIM         3
1970     GBR         AUS         1
1970     GBR         USA         4
1970     GBR         ZIM         5
1970     USA         AUS         2
1970     USA         GBR         4
1970     USA         ZIM         6
1970     ZIM         AUS         3
1970     ZIM         GBR         5
1970     ZIM         USA         6

My dataset has around 2 million rows, comprising around 44,000 country pairs over 47 years.

I have used the following code to create an ID for each ordered country pair.

data$ORD_PAIR_ID <- data %>% group_indices(data$ISO_EXP, data$ISO_IMP)

But I have not been able to work out how to create an ID for unordered pairs.

Any help greatly appreciated.

markdavis
  • 31
  • 3
  • 1
    Take a look here - https://stackoverflow.com/questions/25145982/data-table-with-two-string-columns-of-set-elements-extract-unique-rows-with-eac/25151395 or here - https://stackoverflow.com/questions/25297812/pair-wise-duplicate-removal-from-dataframe and check out the questions linked from those in-turn. Or even this one early today which covered a similar problem in dplyr - https://stackoverflow.com/questions/55733670/combining-data-frame-based-on-data-in-each-column-in-dplyr/55734428 – thelatemail Apr 18 '19 at 00:57
  • Btw, doing `data %>% group_indices(data$ISO_EXP, data$ISO_IMP)` is going to break your analyses pretty quickly. You want to do `data %>% group_indices(ISO_EXP, ISO_IMP)` - no need to subset via `$` as you are working within `data` via the `%>%` – thelatemail Apr 18 '19 at 01:00

1 Answers1

1

We can sort the elements in the same of ISO_EXP, ISO_IMP with pmin/pmax and use that in group_indices

library(dplyr)
df1 %>%  
    mutate(pairid = group_indices(., pmax(ISO_EXP, ISO_IMP), 
                                     pmin(ISO_EXP, ISO_IMP)))
#   YEAR ISO_EXP ISO_IMP UNORD_PAIR_ID pairid
#1  1970     AUS     GBR             1      1
#2  1970     AUS     USA             2      2
#3  1970     AUS     ZIM             3      4
#4  1970     GBR     AUS             1      1
#5  1970     GBR     USA             4      3
#6  1970     GBR     ZIM             5      5
#7  1970     USA     AUS             2      2
#8  1970     USA     GBR             4      3
#9  1970     USA     ZIM             6      6
#10 1970     ZIM     AUS             3      4
#11 1970     ZIM     GBR             5      5
#12 1970     ZIM     USA             6      6

Or using base R

v1 <- do.call(paste, as.data.frame(t(apply(df1[2:3], 1, sort))))
df1$pairid <-  match(v1, unique(v1))

data

df1 <- structure(list(YEAR = c(1970L, 1970L, 1970L, 1970L, 1970L, 1970L, 
 1970L, 1970L, 1970L, 1970L, 1970L, 1970L), ISO_EXP = c("AUS", 
  "AUS", "AUS", "GBR", "GBR", "GBR", "USA", "USA", "USA", "ZIM", 
  "ZIM", "ZIM"), ISO_IMP = c("GBR", "USA", "ZIM", "AUS", "USA", 
  "ZIM", "AUS", "GBR", "ZIM", "AUS", "GBR", "USA"), UNORD_PAIR_ID = c(1L, 
  2L, 3L, 1L, 4L, 5L, 2L, 4L, 6L, 3L, 5L, 6L)), class = "data.frame", 
  row.names = c(NA, -12L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Thanks v. much akrun. This has achieved what I wanted. – markdavis Apr 19 '19 at 00:48
  • I found the base R solution worked. I upvoted your answer, but as a new use of stackoverflow I do not have enough reputation points for my vote to display.Thanks again. – markdavis Apr 20 '19 at 04:17