4

Problem description

I am working with a spreadsheet of conflict events in the United States. Each row represents a single event and has geographic and temporal information included. Conflict events tend to occur in 'waves' (relatively tight temporal groupings). I have generated an identity variable for each of these waves and would like to create a variable that measures the geographic spread of these conflict events over the course of each wave.

I wanted to do this in Excel, but unfortunately I don't have the dynamic array formulae available. Before upgrading to a new version of Excel, I want to see whether it is possible in R. The data are already sorted by region, date, and wave.

Data description

The dataset is structured as follows:

Country     Region     Date       Event     Wave
-------     -------    ------     -------   ------
USA         Vermont    5/1/2017   Strike    Wave 1
USA         Vermont    5/2/2017   Strike    Wave 1
USA         New Hamp.  5/3/2017   Strike    Wave 1
USA         Vermont    5/3/2017   Strike    Wave 1
USA         Maine      5/4/2017   Strike    Wave 1
USA         Washingt.  8/16/2018  Riot      Wave 2
USA         Washingt.  8/18/2018  Riot      Wave 2
USA         Oregon     8/18/2018  Protest   Wave 2
USA         Californ.  8/19/2018  Riot      Wave 2
USA         Nevada     8/20/2018  Protest   Wave 2
USA         Idaho      8/20/2018  Riot      Wave 2

What I want to create

I want to create a variable ("geo_disp") that records the number of regions that have experienced conflict within a given wave. Throughout the wave, I expect the number of regions to increase, and I would like the geo_disp variable to record this.

You will notice that when two events occur on the same day but in different locations, BOTH are recorded with the total number of regions.

Here is what I want the data to look like:

Country     Region     Date       Event     Wave    geo_disp
-------     -------    ------     -------   ------  --------
USA         Vermont    5/1/2017   Strike    Wave 1   1
USA         Vermont    5/2/2017   Strike    Wave 1   1
USA         New Hamp.  5/3/2017   Strike    Wave 1   2
USA         Vermont    5/3/2017   Strike    Wave 1   2
USA         Maine      5/4/2017   Strike    Wave 1   3
USA         Washingt.  8/16/2018  Riot      Wave 2   1
USA         Washingt.  8/18/2018  Riot      Wave 2   2
USA         Oregon     8/18/2018  Protest   Wave 2   2
USA         Californ.  8/19/2018  Riot      Wave 2   3
USA         Nevada     8/20/2018  Protest   Wave 2   5
USA         Idaho      8/20/2018  Riot      Wave 2   5

How can I create the geo_disp variable using R?

Thank you in advance - I greatly appreciate it.

Yasha
  • 330
  • 2
  • 16

4 Answers4

4

A dplyr solution that keeps the whole data set.

library(dplyr)

df %>% group_by(Wave) %>% mutate(disp_geo = cumsum(!duplicated(Region)))
#> # A tibble: 11 x 6
#> # Groups:   Wave [2]
#>    Country Region    Date      Event   Wave   disp_geo
#>    <chr>   <chr>     <chr>     <chr>   <chr>     <int>
#>  1 USA     Vermont   5/1/2017  Strike  Wave 1        1
#>  2 USA     Vermont   5/2/2017  Strike  Wave 1        1
#>  3 USA     New Hamp. 5/3/2017  Strike  Wave 1        2
#>  4 USA     Vermont   5/3/2017  Strike  Wave 1        2
#>  5 USA     Maine     5/4/2017  Strike  Wave 1        3
#>  6 USA     Washingt. 8/16/2018 Riot    Wave 2        1
#>  7 USA     Washingt. 8/18/2018 Riot    Wave 2        1
#>  8 USA     Oregon    8/18/2018 Protest Wave 2        2
#>  9 USA     Californ. 8/19/2018 Riot    Wave 2        3
#> 10 USA     Nevada    8/20/2018 Protest Wave 2        4
#> 11 USA     Idaho     8/20/2018 Riot    Wave 2        5

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
1

If you don't mind removing duplicate regions in the same wave, you could try this approach using the tidyverse:

library(tidyverse)

df <- tribble(
  ~Country,   ~Region,     ~Date,       ~Event,     ~Wave,
  'USA',         'Vermont',    '5/1/2017',   'Strike',    'Wave 1',
  'USA',         'Vermont',    '5/2/2017',   'Strike',    'Wave 1',
  'USA',         'New Hamp.',  '5/3/2017',   'Strike',    'Wave 1',
  'USA',         'Vermont',    '5/3/2017',   'Strike',    'Wave 1',
  'USA',         'Maine',      '5/4/2017',   'Strike',    'Wave 1',
  'USA',         'Washingt.',  '8/16/2018',  'Riot',      'Wave 2',
  'USA',         'Washingt.',  '8/18/2018',  'Riot',      'Wave 2',
  'USA',         'Oregon',     '8/18/2018',  'Protest',   'Wave 2',
  'USA',         'Californ.',  '8/19/2018',  'Riot',      'Wave 2',
  'USA',         'Nevada',     '8/20/2018',  'Protest',   'Wave 2',
  'USA',        'Idaho',      '8/20/2018',  'Riot',      'Wave 2'
)

df %>% distinct(Region, .keep_all = T) %>% group_by(Wave) %>% mutate(geo_disp = 1:n())

Note, dput() is a good way to make data easy to share in R.

> dput(df)
structure(list(Country = c("USA", "USA", "USA", "USA", "USA", 
"USA", "USA", "USA", "USA", "USA", "USA"), Region = c("Vermont", 
"Vermont", "New Hamp.", "Vermont", "Maine", "Washingt.", "Washingt.", 
"Oregon", "Californ.", "Nevada", "Idaho"), Date = c("5/1/2017", 
"5/2/2017", "5/3/2017", "5/3/2017", "5/4/2017", "8/16/2018", 
"8/18/2018", "8/18/2018", "8/19/2018", "8/20/2018", "8/20/2018"
), Event = c("Strike", "Strike", "Strike", "Strike", "Strike", 
"Riot", "Riot", "Protest", "Riot", "Protest", "Riot"), Wave = c("Wave 1", 
"Wave 1", "Wave 1", "Wave 1", "Wave 1", "Wave 2", "Wave 2", "Wave 2", 
"Wave 2", "Wave 2", "Wave 2")), row.names = c(NA, -11L), class = c("tbl_df", 
"tbl", "data.frame"))

bischrob
  • 544
  • 3
  • 10
  • thanks for your help. I am hoping to not remove observations from the dataset. I really appreciate the dput() help - I will use that next time. – Yasha Jun 17 '20 at 20:57
1

We can use match after grouping by 'Wave'

library(data.table)
setDT(df)[, geo_disp := match(Region, unique(Region)), Wave]

or with dplyr

library(dplyr)
df %>%
  group_by(Wave) %>%
  mutate(geo_disp = match(Region, unique(Region))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Hi @akrun - thanks very much. When I run the code that you have supplied, the geo_disp variable doesn't seem to accumulate. My hope is that this variable will track the number of regions that events have occurred in over the course of the wave. – Yasha Jun 17 '20 at 20:56
  • @Yasha Perhaps, it is a different example. Sorry – akrun Jun 17 '20 at 23:43
  • 1
    no reason to apologize! I really appreciate your help. – Yasha Jun 18 '20 at 20:38
1

Previous answers solve the problem, but to add that for a cumulative count (which I think Yasha is trying to achieve) you'd do:

library(data.table)
set.seed(1)
toy_data = data.table(
  region = sample(LETTERS[1:3], 10, replace = T),
  wave = c(rep(1,5),rep(2,5))
)
toy_data[,count:=cummax(match(region, unique(region))), wave]
# > toy_data
#     region wave count
#  1:      A    1     1
#  2:      C    1     2
#  3:      A    1     2
#  4:      B    1     3
#  5:      A    1     3
#  6:      C    2     1
#  7:      C    2     1
#  8:      B    2     2
#  9:      B    2     2
# 10:      C    2     2
patalt
  • 465
  • 3
  • 10
  • thank you for your help. When I run your code, I get this error message: "Error in `:=`(count, cummax(match(region, unique(region)))) : Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways. See help(":=")." – Yasha Jun 17 '20 at 21:07
  • Sorry @Yasha, I hadn't put `library(data.table)`at the top. Now it should work. – patalt Jun 17 '20 at 21:11
  • yes, that works now! Wow, thank you so much! – Yasha Jun 17 '20 at 21:14