3

I've come across something a bit above my skill set. I'm working with IMF trade data that consists of data between country dyads. The IMF dataset consists of ' unordered duplicate' records in that each country individually reports trade data. However, due to a variety of timing, recording systems, regime type, etc., there are discrepancies between corresponding values. I'm trying to manipulate this data in two ways:

  1. Assign the mean values to the duplicated dyads.
  2. Assign the dyad values conditionally based on a separate economic indicator or development index (who do I trust more?).

There are several discussions of identifying unordered duplicates here, here, here, and here but after a couple days of searching I have yet to see what I'm trying to do.

Here is an example of the raw data. In reality there are many more variables and several hundred thousand dyads:

reporter<-c('USA','GER','AFG','FRA','CHN')
partner<-c('AFG','CHN','USA','CAN','GER')
year<-c(2010,2010,2010,2009,2010)
import<-c(-1000,-2000,-2400,-1200,-2000)
export<-c(2500,2200,1200,2900,2100)
rep_econ1<-c(28,32,12,25,19)

imf<-data.table(reporter,partner,year,import,export,rep_econ1)

imf

   reporter partner year import export rep_econ1
1:      USA     AFG 2010  -1000   2500        28
2:      GER     CHN 2010  -2000   2200        32
3:      AFG     USA 2010  -2400   1200        12
4:      FRA     CAN 2009  -1200   2900        25
5:      CHN     GER 2010  -2000   2100        19

The additional wrinkle is that import and export are inverses of each other between the dyads, so they need to be matched and meaned with an absolute value.

For objective 1, the resulting data.table is:

Mean

 reporter   partner year    import  export  rep_econ1
  USA        AFG    2010    -1100   2450    28
  GER        CHN    2010    -2050   2100    32
  AFG        USA    2010    -2450   1100    12
  FRA        CAN    2009    -1200   2900    25
  CHN        GER    2010    -2100   2050    19

For objective 2:

Conditionally Assign on Higher Economic Indicator (rep_econ1)

 reporter   partner year    import  export  rep_econ1
 USA         AFG    2010    -1000   2500    28
 GER         CHN    2010    -2000   2200    32
 AFG         USA    2010    -2500   1000    12
 FRA         CAN    2009    -1200   2900    25
 CHN         GER    2010    -2200   2000    19

It's possible not all dyads are represented twice so I included a solo record. I prefer data.table but I'll go with anything that leads me down the right path.

Thank you for your time.

Josh Brinks
  • 141
  • 6

1 Answers1

3

Pre - Processing:

library(data.table)

# get G = reporter/partner group and N = number of rows for each group
# Thanks @eddi for simplifying
imf[, G := .GRP, by = .(year, pmin(reporter, partner), pmax(reporter, partner))]
imf[, N := .N, G]

Option 1 (means)

# for groups with 2 rows, average imports and exports
imf[N == 2
    ,   `:=`(import = (import - rev(export))/2
           , export = (export - rev(import))/2)
    , by = G]

imf


#    reporter partner year import export rep_econ1 G N
# 1:      USA     AFG 2010  -1100   2450        28 1 2
# 2:      GER     CHN 2010  -2050   2100        32 2 2
# 3:      AFG     USA 2010  -2450   1100        12 1 2
# 4:      FRA     CAN 2009  -1200   2900        25 3 1
# 5:      CHN     GER 2010  -2100   2050        19 2 2

Option 2 (highest economic indicator)

# for groups with 2 rows, choose imports and exports based on highest rep_econ1
imf[N == 2
    , c('import', 'export') := {
        o <- order(-rep_econ1)
        import <- cbind(import, -export)[o[1], o]
        .(import, export = -rev(import))}
    , by = G]

imf


#    reporter partner year import export rep_econ1 G N
# 1:      USA     AFG 2010  -1000   2500        28 1 2
# 2:      GER     CHN 2010  -2000   2200        32 2 2
# 3:      AFG     USA 2010  -2500   1000        12 1 2
# 4:      FRA     CAN 2009  -1200   2900        25 3 1
# 5:      CHN     GER 2010  -2200   2000        19 2 2

Option 2 explanation: You need to select the row with the highest economic indicator (i.e. row order(-rep_econ1)[1]) and use that for imports, but if the second row is the "trusted" one, it needs to be reversed. Otherwise you'd have the countries switched, since the second reporter's imports (now the first element of cbind(import, -export)[o[1],]) would be assigned as the first reporter's imports (because it's the first element).

Edit:

If imports and exports are both positive in the input data and need to be positive in the output data, the two calculations above can be modified as

imf[N == 2
    ,   `:=`(import = (import + rev(export))/2
           , export = (export + rev(import))/2)
    , by = G]

And

imf[N == 2
    , c('import', 'export') := {
        o <- order(-rep_econ1)
        import <- cbind(import, export)[o[1], o]
        .(import, export = rev(import))}
    , by = G]
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38
  • 3
    a standard trick to get the grouping is `by = .(pmin(reporter, partner), pmax(reporter, partner))`; + you can check .N on the fly without the need to precompute it – eddi Dec 07 '18 at 21:24
  • that's not what I meant about .N, I meant using an `if/else` (arguably more cumbersome) - you still need to precompute it if you want to select those rows only to begin with – eddi Dec 07 '18 at 21:38
  • Oh I see. I thought maybe I had missed a `data.table` update and `.N` could be used in `i` now – IceCreamToucan Dec 07 '18 at 21:44
  • Thank you for you effort. I will try this out tomorrow. I appreciate the detailed response how it's actually carrying out the operation as opposed to simply code. – Josh Brinks Dec 08 '18 at 02:40
  • Country pairs need to account for the reporting year. The `pmin(reporter, partner)` lumps all dyad records into one group. It doesn't look like you can just put year in the `pmin` statement? – Josh Brinks Dec 10 '18 at 15:53
  • 1
    You could put it in the `by` argument. Just add it to the front as `by = .(year, pmin(reporter, partner), pmax(reporter, partner))]`. I've edited the answer to add that. – IceCreamToucan Dec 10 '18 at 15:57
  • 1
    Looks like `imf[, G := .GRP, by = .(pmin(reporter, partner), pmax(reporter, partner),year)]` works. – Josh Brinks Dec 10 '18 at 16:00
  • Sure, the order of variables in `by` doesn't matter – IceCreamToucan Dec 10 '18 at 16:00
  • One last question (already marked as correct answer), I was mistaken and imports aren't recorded as negative. How would the code be adjusted to handle this; assuming it's a simple fix? I just flipped the signs before and after using your code so it's fine, but I was curious. – Josh Brinks Dec 10 '18 at 16:46
  • You may have already figured it out by now, but I added an edit for how the calculations change if imports and exports are both positive. – IceCreamToucan Dec 11 '18 at 13:30