Merging cases into one in R

Question

I have a very newbie question. I'm using the Aid Worker Security Database, which records episodes of violence against aid workers, with incident reports from 1997 through the present. The events are marked independently in the dataset. I would like to merge all events that happened in a single country in a given year, sum the values of the other variables and create a simple time series with the same number of years for all countries (1997-2013). Any idea how to do it?

df
#   year  country totalnationals internationalskilled
# 1 1997   Rwanda              0                    3
# 2 1997 Cambodia              1                    0
# 3 1997  Somalia              0                    1
# 4 1997   Rwanda              1                    0
# 5 1997 DR Congo             10                    0
# 6 1997  Somalia              1                    0
# 7 1997   Rwanda              1                    0
# 8 1998   Angola              5                    0

Where "df" is defined as:

df <- structure(list(year = c(1997L, 1997L, 1997L, 1997L, 1997L, 1997L, 
  1997L, 1998L), country = c("Rwanda", "Cambodia", "Somalia", "Rwanda", 
  "DR Congo", "Somalia", "Rwanda", "Angola"), totalnationals = c(0L, 
  1L, 0L, 1L, 10L, 1L, 1L, 5L), internationalskilled = c(3L, 0L, 
  1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("year", "country", "totalnationals", 
  "internationalskilled"), class = "data.frame", row.names = c(NA, -8L))

I would like to have something like that:

#    year  country totalnationals internationalskilled
# 1  1997   Rwanda              2                    3
# 2  1997 Cambodia              1                    0
# 3  1997  Somalia              1                    1
# 4  1997 DR Congo             10                    0
# 5  1997   Angola              0                    0
# 6  1998   Rwanda              0                    0
# 7  1998 Cambodia              0                    0
# 8  1998  Somalia              0                    0
# 9  1998 DR Congo              0                    0
# 10 1998   Angola              5                    0

Sorry for the very, very newbie question... but so far I couldn't figure out how to do it. Thanks! :-)

Please read [this](http://stackoverflow.com/q/5963269/324364) and then edit your question accordingly. — joran, Nov 30 '13 at 17:57

TheComeOnMan · Accepted Answer · 2013-11-30T21:41:03.413

1

Updated after OP's comments -

df <- subset(df, year <= 2013 & year >= 1997)
df$totalnationals <- as.integer(df$totalnationals)
df$internationalskilled <- as.integer(df$internationalskilled)
df2 <- aggregate(data = df,cbind(totalnationals,internationalskilled)~year+country, sum)

To add 0s for years without a record -

df3 <- expand.grid(unique(df$year),unique(df$country))
df3 <- merge(df3,df2, all.x = TRUE, by = 1:2)
df3[is.na(df3)] <- 0

edited Nov 30 '13 at 21:41

answered Nov 30 '13 at 18:17

TheComeOnMan

12,535
8
39
54

Thank you for the answer, but it didn't work out as I expected. It either shows 'Error in eval(expr, envir, enclos) : object 'year' not found' or it gives me an empty dataset if I include 'df <- subset(df, df$year <= 2013 & df$year >= 1997)'. Thanks anyway :) – danilofreire Nov 30 '13 at 19:11
1

Updated. Also, I might be missing an aspect, which is if you need a 0 for years in which nobody was killed. Is that required? – TheComeOnMan Nov 30 '13 at 19:24
1

Added the inserting zero part too. – TheComeOnMan Nov 30 '13 at 19:35
The code for the zero part works, but I ended up with repeated values for the same country and year. After the merge I ended up with around 1.5 million cases and I should have about 1190 cases (17 years (1997-2013) x 70 countries). Is there any way to remove the repeated cases? Thanks! – danilofreire Nov 30 '13 at 20:58
1

Sorry, my bad. The `expand.grid` needed a `unique` inside of it. Can you try it now? – TheComeOnMan Nov 30 '13 at 21:41
It works wonders, thanks for your help. You've solved my problem! :) – danilofreire Nov 30 '13 at 22:08

score 1 · Answer 2 · answered Nov 30 '13 at 21:41

Same thing with data tables (can be faster on large datasets).

library(data.table)
dt   <- data.table(df,key="year,country")
smry <- dt[,list(totalnationals      =sum(totalnationals), 
                 internationalskilled=sum(internationalskilled)),
           by="year,country"]
countries   <- unique(dt$country)
template    <- data.table(year=rep(1997:2013,each=length(countries)),
                          country=countries, 
                          key="year,country")
time.series <- smry[template]
time.series[is.na(time.series)]=0

Merging cases into one in R

2 Answers2