efficient way to match and sum variables of two data frames based on two criteria

Question

I have a data frame df1 on import data for 397 different industries over 17 years and several different exporting countries/ regions.

> head(df1)
     year importer exporter      imports sic87dd
2300 1991      USA      CAN 9.404848e+05    2011
2301 1991      USA      CAN 2.259720e+04    2015
2302 1991      USA      CAN 5.459608e+02    2021
2303 1991      USA      CAN 1.173237e+04    2022
2304 1991      USA      CAN 2.483033e+04    2023
2305 1991      USA      CAN 5.353975e+00    2024

However, I want the sum of all imports for a given industry and a given year, regardless of where they came from. (The importer is always the US, sic87dd is a code that uniquely identifies the 397 industries)

So far I have tried the following code, which works correctly but is terribly inefficient and takes ages to run.

sic87dd <- unique(df1$sic87dd)
year <- unique (df1$year)
df2 <- data.frame("sic87dd" = rep(sic87dd, each = 17), "year" = rep(year, 397), imports = rep(0, 6749))
i <- 1
j <- 1

while(i <= nrow(df2)){
  while(j <= nrow(df1)){
    if((df1$sic87dd[j] == df2$sic87dd[i]) == TRUE & (df1$year[j] == df2$year[i]) == TRUE){
      df2$imports[i] <- df2$imports[i] + df1$imports[j]
    }
      j <- j + 1
  }
  i <- i + 1
  j <- 1
}

Is there a more efficient way to do this? I have seen some questions here that were somewhat similar and suggested the use of the data.table package, but I can't figure out how to make it work in my case.

Any help is appreciated.

Please give a reproducible example along with expected output. Thanks — Sotos, Jan 17 '19 at 14:44

score 0 · Accepted Answer · answered Jan 17 '19 at 14:53

There is a simple solution using dplyr:

First, you'll need to set your industry field as a factor (I'm assuming this entire field consists of a 4 digit number):

df1$sic87dd <- as.factor(df1$sic87dd)

Next, use the group_by command and summarise:

df1 %>% 
  group_by(sic87dd) %>% 
  summarise(total_imports = sum(imports))

efficient way to match and sum variables of two data frames based on two criteria

1 Answers1