Turning variable values into column names; "duplicate identifiers for rows" in tidyr::spread

Question

I am working with a messy voter file. Consider the following tibble:

library(dplyr)
library(tidyr)
dat <- tibble(
  id = factor(c("A","B","C","D","E")),
  demographic_info1 = round(rnorm(5),2),
  demographic_info2 = round(rnorm(5),2),
  election_1 = c(NA,"GN2016","GN2016","SE2016","GN2008"),
  election_2 = c(NA,"MT2014","GN2012","GN2016","GN2004"),
  election_3 = c(NA,NA,NA,"MT2014","GN2000"),
  election_4 = c(NA,NA,NA,"GN2012",NA),
  election_5 = c(NA,NA,NA,"MT2010",NA),
)

Which looks like:

# A tibble: 5 x 8
      id demographic_info1 demographic_info2 election_1 election_2 election_3 election_4 election_5
  <fctr>             <dbl>             <dbl>      <chr>      <chr>      <chr>      <chr>      <chr>
1      A             -1.50              0.81       <NA>       <NA>       <NA>       <NA>       <NA>
2      B             -1.84             -0.64     GN2016     MT2014       <NA>       <NA>       <NA>
3      C              1.66             -0.10     GN2016     GN2012       <NA>       <NA>       <NA>
4      D              0.91             -0.08     SE2016     GN2016     MT2014     GN2012     MT2010
5      E              0.04             -1.15     GN2008     GN2004     GN2000       <NA>       <NA>

Each id is a unique identifier for a voter.
The two demographic_info columns are filler, just to demonstrate that I want to retain these values when I do my reshaping of the data.

The columns election_1 to election_5 are what I am interested in. The data are structured such that the file includes the most recent 5 elections someone has participated in. election_1 is the most recent, election_5 is the least recent.

Note that person A has never turned out to vote, while person D always does. What I would like to do is turn these columns into a number of variables: SE2016, GN2016, MT2014, GN2012, etc.; that is, all of the values in election_1 to election_5. I would like each of these to be variables that are either TRUE or FALSE for whether or not that person showed up to the polls. I have tried this code:

dat %>% # take data
  gather(election, race, election_1:election_5) %>% # gather by election
  mutate(temp=TRUE) %>% # make new variable that is all TRUE
  select(-election) %>% # drop election variable
  spread(race, temp, fill=FALSE) # spread by this all TRUE variable, fill all NAs as FALSE

However, spread throws the error:

Error: Duplicate identifiers for rows (1, 6, 11, 16, 21), (12, 17, 22), (13, 18, 23), (20, 25)

This is because there are multiple entries for each of the values of the race variable. I've tried to group_by(id) before doing the spread, but the same error is thrown.

I would like the resulting tibble to look like:

# A tibble: 5 x 11
      id demographic_info1 demographic_info2 SE2016 GN2016 MT2014 GN2012 MT2010 GN2008 GN2004 GN2000
  <fctr>             <dbl>             <dbl>  <lgl>  <lgl>  <lgl>  <lgl>  <lgl>  <lgl>  <lgl>  <lgl>
1      A             -0.91             -0.56  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE
2      B              1.24             -1.78  FALSE   TRUE   TRUE  FALSE  FALSE  FALSE  FALSE  FALSE
3      C              0.61              0.11  FALSE   TRUE  FALSE   TRUE  FALSE  FALSE  FALSE  FALSE
4      D              2.43             -0.53   TRUE   TRUE   TRUE   TRUE   TRUE  FALSE  FALSE  FALSE
5      E             -1.40             -1.23  FALSE  FALSE  FALSE  FALSE  FALSE   TRUE   TRUE   TRUE

You need to create a sequence column by group before doing the `spread` — akrun, Aug 19 '17 at 18:31
@akrun does the `id` variable not count as a sequence column? Not sure I follow. — Mark White, Aug 19 '17 at 18:32

TJ Mahr · Accepted Answer · 2017-08-21T16:05:28.773

tidyr provides some syntax for dealing with this problem.

# set up
library(dplyr)
library(tidyr)
dat <- tibble(
  id = factor(c("A","B","C","D","E")),
  demographic_info1 = round(rnorm(5),2),
  demographic_info2 = round(rnorm(5),2),
  election_1 = c(NA,"GN2016","GN2016","SE2016","GN2008"),
  election_2 = c(NA,"MT2014","GN2012","GN2016","GN2004"),
  election_3 = c(NA,NA,NA,"MT2014","GN2000"),
  election_4 = c(NA,NA,NA,"GN2012",NA),
  election_5 = c(NA,NA,NA,"MT2010",NA)
)

What we eventually want is a TRUE or FALSE for every voter (5) x election (8) pairing. When we gather the data into a long format, we only see the voter x election combinations that exist in the data-set.

d_votes <- dat %>%
  gather("variable", "election", election_1:election_5) %>%
  select(-variable) %>%
  mutate(voted = TRUE)
d_votes
#> # A tibble: 25 x 5
#>        id demographic_info1 demographic_info2 election voted
#>    <fctr>             <dbl>             <dbl>    <chr> <lgl>
#>  1      A              0.76             -0.23     <NA>  TRUE
#>  2      B             -0.80              0.08   GN2016  TRUE
#>  3      C             -0.33              1.60   GN2016  TRUE
#>  4      D             -0.50             -1.27   SE2016  TRUE
#>  5      E             -1.03              0.59   GN2008  TRUE
#>  6      A              0.76             -0.23     <NA>  TRUE
#>  7      B             -0.80              0.08   MT2014  TRUE
#>  8      C             -0.33              1.60   GN2012  TRUE
#>  9      D             -0.50             -1.27   GN2016  TRUE
#> 10      E             -1.03              0.59   GN2004  TRUE
#> # ... with 15 more rows

count(d_votes, election)
#> # A tibble: 9 x 2
#>   election     n
#>      <chr> <int>
#> 1   GN2000     1
#> 2   GN2004     1
#> 3   GN2008     1
#> 4   GN2012     2
#> 5   GN2016     3
#> 6   MT2010     1
#> 7   MT2014     2
#> 8   SE2016     1
#> 9     <NA>    13

We need to generate every combination of voter and election. tidyr's expand() function creates all combinations of variables from different columns/vectors of data. (It works like the base function expand.grid(), so the name expand() is evocative).

d_possible_votes <- d_votes %>%
  expand(nesting(id, demographic_info1, demographic_info2),
         election)
d_possible_votes
#> # A tibble: 40 x 4
#>        id demographic_info1 demographic_info2 election
#>    <fctr>             <dbl>             <dbl>    <chr>
#>  1      A              0.76             -0.23   GN2000
#>  2      A              0.76             -0.23   GN2004
#>  3      A              0.76             -0.23   GN2008
#>  4      A              0.76             -0.23   GN2012
#>  5      A              0.76             -0.23   GN2016
#>  6      A              0.76             -0.23   MT2010
#>  7      A              0.76             -0.23   MT2014
#>  8      A              0.76             -0.23   SE2016
#>  9      B             -0.80              0.08   GN2000
#> 10      B             -0.80              0.08   GN2004
#> # ... with 30 more rows

Note that we now have 8 elections x 5 ids = 40 rows.

We used the nesting() function to treat each (id, demographic_info1, demographic_info2) set/row as a single unit; demographics are nested within ids. Expanding provided all 40 combinations of (id, demographic_info1, demographic_info2) x election.

If we join the observed votes onto the possible votes, the voted column is populated with TRUE or NA values. tidyr's replace_na() function can correct those NA values.

d_possible_votes <- d_possible_votes %>%
  left_join(d_votes) %>%
  replace_na(list(voted = FALSE))
#> Joining, by = c("id", "demographic_info1", "demographic_info2", "election")
d_possible_votes
#> # A tibble: 40 x 5
#>        id demographic_info1 demographic_info2 election voted
#>    <fctr>             <dbl>             <dbl>    <chr> <lgl>
#>  1      A              0.76             -0.23   GN2000 FALSE
#>  2      A              0.76             -0.23   GN2004 FALSE
#>  3      A              0.76             -0.23   GN2008 FALSE
#>  4      A              0.76             -0.23   GN2012 FALSE
#>  5      A              0.76             -0.23   GN2016 FALSE
#>  6      A              0.76             -0.23   MT2010 FALSE
#>  7      A              0.76             -0.23   MT2014 FALSE
#>  8      A              0.76             -0.23   SE2016 FALSE
#>  9      B             -0.80              0.08   GN2000 FALSE
#> 10      B             -0.80              0.08   GN2004 FALSE
#> # ... with 30 more rows

Now, we can spread out the elections and achieve the desired dataframe.

spread(d_possible_votes, election, voted)
#> # A tibble: 5 x 11
#>       id demographic_info1 demographic_info2 GN2000 GN2004 GN2008 GN2012 GN2016 MT2010 MT2014 SE2016
#> * <fctr>             <dbl>             <dbl>  <lgl>  <lgl>  <lgl>  <lgl>  <lgl>  <lgl>  <lgl>  <lgl>
#> 1      A              0.76             -0.23  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE
#> 2      B             -0.80              0.08  FALSE  FALSE  FALSE  FALSE   TRUE  FALSE   TRUE  FALSE
#> 3      C             -0.33              1.60  FALSE  FALSE  FALSE   TRUE   TRUE  FALSE  FALSE  FALSE
#> 4      D             -0.50             -1.27  FALSE  FALSE  FALSE   TRUE   TRUE   TRUE   TRUE   TRUE
#> 5      E             -1.03              0.59   TRUE   TRUE   TRUE  FALSE  FALSE  FALSE  FALSE  FALSE

This pattern of generating combinations of identifiers, joining actual data, and correcting missing values is very common—so much so that tidyr includes a function complete() to do all three at once.

d_votes %>%
  complete(nesting(id, demographic_info1, demographic_info2),
           election, fill = list(voted = FALSE)) %>%
  spread(election, voted)
#> # A tibble: 5 x 11
#>       id demographic_info1 demographic_info2 GN2000 GN2004 GN2008 GN2012 GN2016 MT2010 MT2014 SE2016
#> * <fctr>             <dbl>             <dbl>  <lgl>  <lgl>  <lgl>  <lgl>  <lgl>  <lgl>  <lgl>  <lgl>
#> 1      A              0.76             -0.23  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE
#> 2      B             -0.80              0.08  FALSE  FALSE  FALSE  FALSE   TRUE  FALSE   TRUE  FALSE
#> 3      C             -0.33              1.60  FALSE  FALSE  FALSE   TRUE   TRUE  FALSE  FALSE  FALSE
#> 4      D             -0.50             -1.27  FALSE  FALSE  FALSE   TRUE   TRUE   TRUE   TRUE   TRUE
#> 5      E             -1.03              0.59   TRUE   TRUE   TRUE  FALSE  FALSE  FALSE  FALSE  FALSE

Thanks for the great explanation on how `complete` works. I wasn't aware of that function! What are your thoughts on my solution, in comparison? My code feels less elegant, but `microbenchmark` says it runs quicker (about 6 milliseconds quicker, which could mean a lot when I run it on the 1GB file) — Mark White, Aug 21 '17 at 15:57

score 1 · Answer 2 · answered Aug 19 '17 at 18:33

1

We can use group_by on 'id' to create a sequence variable as the 'id' is duplicated and later remove it after the spread

dat %>%
   gather(election, race, election_1:election_5) %>%
   mutate(temp=TRUE)%>% group_by(id) %>%
   mutate(i1 = row_number()) %>% 
   select(-election) %>%
   spread(race, temp, fill=FALSE) %>%
   select(-i1)

answered Aug 19 '17 at 18:33

akrun

874,273
37
540
662

Brilliant, thanks. Just to make sure I understand *why* this is working: `i1` provides a unique identifier within each group...? – Mark White Aug 19 '17 at 18:36
1

@MarkWhite Yes, because there are 5 rows each for each 'id' i.e. `dat %>% gather(election, race, election_1:election_5) %>% mutate(temp=TRUE)%>% count(id) %>% .$n# [1] 5 5 5 5 5` So the 'i1' creates the sequence of 1, 2, ,3, ..5 for each 'id' and it can help in spread – akrun Aug 19 '17 at 18:38
On second glance, using this on my actual dataset: The code above creates five rows for every `id`; one row for every `election_*` field. – Mark White Aug 19 '17 at 19:19
I've seen your other responses, too, like this one (https://stackoverflow.com/a/43259735/7903456). It seems like it should work for my purposes, but for some reason I get a new row for every `election_*` field. – Mark White Aug 19 '17 at 21:27

Mark White · Answer 3 · 2017-08-19T22:40:23.260

The issue was that there were duplicate entries for NA values. I solved the problem of the duplicated identifiers and the multiple rows problem from akrun's answer by taking only unique rows, then grouping by id:

dat %>%
  gather(election, race, election_1:election_5) %>%
  mutate(temp=TRUE) %>%
  select(-election) %>%
  unique() %>% # GET RID OF DUPLICATE NA ENTRIES
  group_by(id) %>% 
  spread(race, temp, fill=FALSE) %>%
  select(-`<NA>`)

# A tibble: 5 x 11
# Groups:   id [5]
      id demographic_info1 demographic_info2 GN2000 GN2004 GN2008 GN2012 GN2016 MT2010 MT2014 SE2016
* <fctr>             <dbl>             <dbl>  <lgl>  <lgl>  <lgl>  <lgl>  <lgl>  <lgl>  <lgl>  <lgl>
1      A             -1.19             -0.94  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE
2      B              1.41             -0.62  FALSE  FALSE  FALSE  FALSE   TRUE  FALSE   TRUE  FALSE
3      C             -0.21              1.62  FALSE  FALSE  FALSE   TRUE   TRUE  FALSE  FALSE  FALSE
4      D              1.51              0.09  FALSE  FALSE  FALSE   TRUE   TRUE   TRUE   TRUE   TRUE
5      E              0.65             -2.09   TRUE   TRUE   TRUE  FALSE  FALSE  FALSE  FALSE  FALSE

Turning variable values into column names; "duplicate identifiers for rows" in tidyr::spread

3 Answers3