0

I am trying to group a data frame by two columns (agency name and year) to create a new data frame containing the sum of my numeric variables and the string variables. However, when I use the group command in dplyr all of my string variables are dropped.

I have a data frame containing data on transportation grants. The variables consist of the grant funding, a dummy indicating if the grant was awarded, the agency that applied, the year, and the state that the agency is located in. Each agency is located in a unique state. I would like to group the data by the agency and the year (because some agencies submitted multiple applications in the same year) and create a new data frame containing the agency, the sum of grant funds, the sum of awarded grants, the year, and the state.

I have tried the following code:

library(tidyverse)

Applicant_Name <- Applications %>%
  group_by(ApplicantName, Year) %>%
  select(everything()) %>%
 summarise_if(is.numeric, sum, na.rm=TRUE)

However, when I run this code the state variable is dropped from the data frame. I have also considered creating another data frame with the string variable via the aggregate command. Then merging this data frame with the Applicant_Name data frame. I used the following code:

test2 <- aggregate(test$State, by = list(test$ApplicantName, test$Year), paste(), 
                   collapse = " ") 

However, this code creates a State column where the state names are listed in each cell as many times as the applicant name appears in the test data frame.

benalbert342
  • 71
  • 1
  • 4
  • 1
    Since we don't have your data, we can't run your code, and we can't see any output, so the best anybody can do is guess. [See here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on making an R question that's easy for folks to help with. – camille Aug 28 '19 at 21:06

1 Answers1

0

I have interpreted your data description to my best effort, and come up with the following:

dt = data.table(
  grant = sample(seq(10000, 500000, by = 10000), 200, T),
  agency = sample(LETTERS[1:3], 200, T),
  awarded = sample(c(T,F), 200, T),
  year = sample(1990:2010, 200, T),
  state = sample(letters[1:10], 200, T)
)

> head(dt)
    grant agency awarded year state
1: 100000      A   FALSE 2007     j
2: 190000      C    TRUE 2006     b
3: 390000      C   FALSE 1992     h
4: 450000      B   FALSE 2002     d
5:  70000      B   FALSE 1996     d
6: 130000      B   FALSE 2000     g

The following code produces the sum of (i) applied grants and (ii) awarded grants for each agency-state-year group.

mat = dt[, .(SumApplied = sum(grant), SumAward = sum(awarded*grant)), keyby = .(agency, year, state)]


> mat
     agency year state SumApplied SumAward
  1:      A 1990     f    1160000   660000
  2:      A 1990     g     360000   360000
  3:      A 1990     j     330000        0
  4:      A 1991     a     800000   800000
  5:      A 1991     d    1120000        0
 ---                                      
169:      C 2009     c     140000   140000
170:      C 2010     a     490000        0
171:      C 2010     b     330000        0
172:      C 2010     f      40000    40000
173:      C 2010     h     360000        0

I hope it helps!

JDG
  • 1,342
  • 8
  • 18