Summarize by year and create a new variable containing a vector / list of unique values for each row

Question

I have a dataset with all natural disaster that occured over a certain time period. I would like to summarize them by year and state. When summarizing I would like to create a variable (= d_disasters) that shows me the unique types of natural disasters, e.g. for Texas, I would expect to only show Hurricane.

I am currently using dplyr:group_by and dplyr::summarize to summarize my data by year and by state & dplyr::mutate and dplyr:map_int to create new variables with the total number of natural disasters per year ($n_disasters using length) and the unique number of natural disasters ($n_distinct using n_distinct()).

Starting dataset:

structure(list(year = c(1998, 1998, 1998, 1998, 1998), country = c("US", 
"US", "US", "US", "US"), state = c("Texas", "Texas", "California", 
"New York", "New York"), deaths = c(12, 5, 9, 10, 18), injured = c(3, 
1, 3, 5, 9), disastertype = c("Hurricane", "Hurricane", "Wild fire", 
"Flood", "Epidemic")), class = "data.frame", row.names = c(NA, 
-5L))

Result dataset:

structure(list(year = c(1998, 1998, 1998), state = c("California", 
"New York", "Texas"), u_disastertype = c("Wild fire", "Flood, Epidemic", 
"Hurricane"), disastertype = c("Wild fire", "Flood, Epidemic", 
"Hurricane, Hurricane"), deaths = c(9, 28, 17), injured = c(3, 
14, 4), n_distinct = c(1L, 2L, 1L), n_disasters = c(1L, 2L, 2L
)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-3L), groups = structure(list(year = 1998, .rows = structure(list(
    1:3), ptype = integer(0), class = c("vctrs_list_of", "vctrs_vctr", 
"list"))), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-1L), .drop = TRUE))

EDIT: Edited for clarification.

Andre Wildberg · Accepted Answer · 2021-11-18T13:18:58.320

1

Try aggregate. This takes the output of 2 3 aggregates and puts them together.

list2 <- function(x){ c(unique(x),length(table(x))) }

lt <- list(year=dat$year, county=dat$country, state=dat$state )

data.frame( aggregate( dat[,c(4,5)], lt, sum ), 
  setNames( aggregate( dat$disastertype, lt, list2 )[,4, drop=F], colnames(dat)[6] ), 
  setNames( aggregate( dat$disastertype, lt, length )[,4, drop=F], "n_disasters") )

  year county      state deaths injured       disastertype n_disasters
1 1998     US California      9       3       Wild fire, 1           1
2 1998     US   New York     28      14 Flood, Epidemic, 2           2
3 1998     US      Texas     17       4       Hurricane, 1           2

Not sure if you want to keep the n_... columns or not though...

EDIT: added "n_disasters"

EDIT2: added suggestion to include "distinct disasters"

edited Nov 18 '21 at 13:18

answered Nov 15 '21 at 17:30

Andre Wildberg

12,344
3
12
29

yes, I would like to keep the `n_columns`, see my answer! I stumbled across it myself yesterday and used it to keep using the `dplyr` package for the moment. I would love to know if there is a solution using `aggregate`. – flxflks Nov 16 '21 at 14:04
You also need distinct? Seems redundant since all are 1. Where does distinct come from? – Andre Wildberg Nov 16 '21 at 15:28
There was a mistake in my coding making the n_distinct 1 for all columns. Actually, n_distinct should be 2 for New York as we have two distinct types of disasters happening in that year (Flood and Epidemic) whereas in Texas we have two disasters but only 1 distinct type (Hurricane). Apologies for the coding mistake. I have corrected it in the question. – flxflks Nov 18 '21 at 11:18

score 0 · Answer 2 · answered Nov 16 '21 at 14:01

The solution using dplyr with group_by and summarize. The key part is to run u_disastertype = toString(unique(disastertype)), before disastertype = paste(disastertype, collapse = ', '),

naturaldisaster2 <- naturaldisaster %>%
  group_by(year, state) %>%
  summarise(
    u_disastertype = toString(unique(disastertype)),
    disastertype = paste(disastertype, collapse = ', '),
    deaths=sum(deaths),
    injured=sum(injured)
    )

The answer is based on this Stackoverflow answer to a similar question, where only one operation was run on the column whereas I am running two operations on the same column: https://stackoverflow.com/a/46367425/11045110

Summarize by year and create a new variable containing a vector / list of unique values for each row

2 Answers2