0

I have a data frame that looks like this (simplified from 699 treaties):

TRT <- data.frame(T.ID=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,8),
              Treaty=c("hungary slovenia 1994", "hungary slovenia 1994",
                                "nicaragua taiwan 2006", "nicaragua taiwan 2006",
                                "ukraine uzbekistan 1994", "ukraine uzbekistan 1994",
                                "brazil uruguay 1986", "brazil uruguay 1986",
                                "albania macedonia 2002", "albania macedonia 2002",
                                "albania moldova 2003", "albania moldova 2003",
                                "albania romania 2003", "albania romania 2003",
                                "Treaty of Izmir 1977","Treaty of Izmir 1977",
                                "Treaty of Izmir 1977"),
              sc.y=c("HUN1994", "SLV1994", "NIC2006", "TAW2006", "UKR1994", 
                     "UZB1994", "BRA1986", "URU1986", "ALB2002", "MAC2002", 
                     "ALB2003", "MLD2003", "ALB2003", "RUM2003", "IRN1977", 
                     "TUR1977", "PAK1977"),
              prom.demo=c(1,1,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0),
              polity=c(10,10,8,10,7,-9,7,9,7,9,7,8,7,8,-10,-7,9))

In the end, I want to have a data frame that lists each treaty only once, its value of the “prom.demo”-column and one column that contains the difference of the maximum and minimum among the “polity”-values of the contracting parties of each treaty (most treaties have only two contracting parties, but some have up to 51). Is there any R command that spares me 699 calculations?

  • I don't get that last complication. Is `scode1` sometimes a comma-separated string of country codes? Where's `scode2` gone in that case? Or is it a different data frame entirely? – Spacedman Nov 04 '14 at 15:40
  • Also, I don't quite understand "For the time being, the sign of this “difference”-value does not have to be the same among values at all costs" – Spacedman Nov 04 '14 at 15:42
  • Also also, your sample data is next to useless since none of the treaties have matching scode/year records in the polity data. Please try and make life easy for answerers by providing workable data sets. Its taken me ten minutes to turn your plain text into data frames, for example. – Spacedman Nov 04 '14 at 16:12
  • 2
    Please read http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Spacedman Nov 04 '14 at 16:12
  • 1. It's a different data frame with indeed comma-separated strings, otherwise I would have to have created up to 51 scode-columns. – Mackendrick Nov 04 '14 at 17:46
  • 1
    No, the tidy way to do it would be a to have a single data frame of (treaty-id, country code) pairs with repeated treaty-id numbers. Then you can easily get all countries in a particular treaty even if there's more than two. It also makes matching easier since you only have to match scode with scode - there's no more scode1 and scode2. – Spacedman Nov 04 '14 at 17:48
  • 2. My country dyads are ordered alphabetically. The sign will depend on which country will be taken as the minuend, and which as the subtrahend. I notice that this sentence is just adding confusion - I'll remove it. – Mackendrick Nov 04 '14 at 18:01
  • I see. Thank you. 3. I'm sorry, I did not want you to go through this trouble, but thanks a lot. – Mackendrick Nov 04 '14 at 18:07

1 Answers1

1

Using dplyr its a join on scode and year followed by grouping by Treaty and then working out the difference between the min and max polity:

require(dplyr)
left_join(treaties, Polity, c("scode","year")) %>% group_by(Treaty) %>% summarise(PolityDiff=max(polity,na.rm=TRUE)-min(polity,na.rm=TRUE))
Source: local data frame [8 x 2]

                   Treaty PolityDiff
1  albania macedonia 2002          2
2    albania moldova 2003          1
3    albania romania 2003          1
4     brazil uruguay 1986          2
5   hungary slovenia 1994          0
6   nicaragua taiwan 2006          2
7    Treaty of Izmir 1977         NA
8 ukraine uzbekistan 1994         16

The NA's are where you don't have any matching scode/year (The Treaty of Izmir is IRN/TUR/PAK in 1977, and none of those are in the Polity data).

Note that if you want NA if any one of the participating countries are not in the Polity data, use:

left_join(treaties, Polity, c("scode","year")) %>% group_by(Treaty) %>% summarise(PolityDiff=max(polity)-min(polity))

which gives:

                   Treaty PolityDiff
1  albania macedonia 2002          2
2    albania moldova 2003          1
3    albania romania 2003          1
4     brazil uruguay 1986          2
5   hungary slovenia 1994         NA
6   nicaragua taiwan 2006          2
7    Treaty of Izmir 1977         NA
8 ukraine uzbekistan 1994         16

because Slovenia is coded as SLV in Polity but there's SLO in the treaties - mistake? Anyway, there's no SLO/1994 in Polity so that treaty returns as NA in this variant. It returns zero in my first example because the NA gets dropped and the polity difference is the difference between one number and itself...

Spacedman
  • 92,590
  • 12
  • 140
  • 224