I am working with a messy voter file. Consider the following tibble
:
library(dplyr)
library(tidyr)
dat <- tibble(
id = factor(c("A","B","C","D","E")),
demographic_info1 = round(rnorm(5),2),
demographic_info2 = round(rnorm(5),2),
election_1 = c(NA,"GN2016","GN2016","SE2016","GN2008"),
election_2 = c(NA,"MT2014","GN2012","GN2016","GN2004"),
election_3 = c(NA,NA,NA,"MT2014","GN2000"),
election_4 = c(NA,NA,NA,"GN2012",NA),
election_5 = c(NA,NA,NA,"MT2010",NA),
)
Which looks like:
# A tibble: 5 x 8
id demographic_info1 demographic_info2 election_1 election_2 election_3 election_4 election_5
<fctr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 A -1.50 0.81 <NA> <NA> <NA> <NA> <NA>
2 B -1.84 -0.64 GN2016 MT2014 <NA> <NA> <NA>
3 C 1.66 -0.10 GN2016 GN2012 <NA> <NA> <NA>
4 D 0.91 -0.08 SE2016 GN2016 MT2014 GN2012 MT2010
5 E 0.04 -1.15 GN2008 GN2004 GN2000 <NA> <NA>
- Each
id
is a unique identifier for a voter. - The two
demographic_info
columns are filler, just to demonstrate that I want to retain these values when I do my reshaping of the data.
The columns election_1
to election_5
are what I am interested in. The data are structured such that the file includes the most recent 5 elections someone has participated in. election_1
is the most recent, election_5
is the least recent.
Note that person A
has never turned out to vote, while person D
always does. What I would like to do is turn these columns into a number of variables: SE2016
, GN2016
, MT2014
, GN2012
, etc.; that is, all of the values in election_1
to election_5
. I would like each of these to be variables that are either TRUE
or FALSE
for whether or not that person showed up to the polls. I have tried this code:
dat %>% # take data
gather(election, race, election_1:election_5) %>% # gather by election
mutate(temp=TRUE) %>% # make new variable that is all TRUE
select(-election) %>% # drop election variable
spread(race, temp, fill=FALSE) # spread by this all TRUE variable, fill all NAs as FALSE
However, spread
throws the error:
Error: Duplicate identifiers for rows (1, 6, 11, 16, 21), (12, 17, 22), (13, 18, 23), (20, 25)
This is because there are multiple entries for each of the values of the race
variable. I've tried to group_by(id)
before doing the spread
, but the same error is thrown.
I would like the resulting tibble
to look like:
# A tibble: 5 x 11
id demographic_info1 demographic_info2 SE2016 GN2016 MT2014 GN2012 MT2010 GN2008 GN2004 GN2000
<fctr> <dbl> <dbl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
1 A -0.91 -0.56 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
2 B 1.24 -1.78 FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
3 C 0.61 0.11 FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE
4 D 2.43 -0.53 TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
5 E -1.40 -1.23 FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE