1

I'm trying to reformat my dataset from long to wide format, but while this is one of the most discussed topics, I couldn't find a solution for my case, nor to generalize from methods others have used.

My data is in long format, where each ID has a different number of rows (relative to other IDs.) I want to transform to wide format where each ID has one row, and the data is represented by columns with a suffix that reflects the order each value appears per ID.

To illustrate: long to wide

Notice that the NAs values don't necessarily correspond between the two formats. In the long format, NAs are simply missing from data; but in the wide format, NAs appear where values for that id fall short in filling the number of values other IDs might have for the variable x.

My Data

In real life, my data has more than one variable, and it could come in one of two versions:

Version 1 :: For each ID, values appear at the same row across variables

## reproducible data
set.seed(125)
runs_per_id <- sample(5:9, 4, replace = TRUE)
id <- rep(1:4, times = runs_per_id)

set.seed(300)
is_value <- sample (c(0, 1), size = length(id), replace = TRUE)
x <- is_value
x[which(as.logical(is_value))] <- sample(1:100, size = sum(x))
y <- is_value
y[which(as.logical(is_value))] <- sample(1:100, size = sum(y))
z <- is_value
z[which(as.logical(is_value))] <- sample(1:100, size = sum(z))
d <- as.data.frame(cbind(id, x, y, z))
d[d == 0] <- NA

d

#    id  x  y  z
# 1   1 38 63 61
# 2   1 17 27 76
# 3   1 32 81 89
# 4   1 NA NA NA
# 5   1 75  2 53
# 6   1 NA NA NA
# 7   2 NA NA NA
# 8   2 40 75  4
# 9   2 NA NA NA
# 10  2 NA NA NA
# 11  2 28 47 70
# 12  2 NA NA NA
# 13  2 71 67 33
# 14  3 NA NA NA
# 15  3 95 26 82
# 16  3 NA NA NA
# 17  3 41  7 99
# 18  3 97  8 68
# 19  4 NA NA NA
# 20  4 NA NA NA
# 21  4 93 38 58
# 22  4 NA NA NA
# 23  4 NA NA NA

Version 2 :: For each ID, values don't necessarily appear at the same row across variables

## reproducible data based on generating d from above
set.seed(12)
d2 <- data.frame(replicate(3, sample(0:1,length(id),rep=TRUE)))
d2[d2 != 0] <- sample(1:100, size = sum(d2 != 0))
d2[d2 == 0] <- NA
colnames(d2) <- c("x", "y", "z")
d2 <- as.data.frame(cbind(id, d2))

d2

##    id  x  y  z
## 1   1 18 28  5
## 2   1 85 93 22
## 3   1 55 59 NA
## 4   1 NA NA 67
## 5   1 NA 15 77
## 6   1 58 NA NA
## 7   2 NA  7 NA
## 8   2 NA NA 91
## 9   2 88 14 NA
## 10  2 13 NA NA
## 11  2 32 NA NA
## 12  2 NA 80 71
## 13  2 40 74 69
## 14  3 NA NA NA
## 15  3 96 NA 76
## 16  3 NA NA NA
## 17  3 73 66 NA
## 18  3 52 NA NA
## 19  4 56 12 16
## 20  4 53 NA NA
## 21  4 NA 42 84
## 22  4 39 99 NA
## 23  4 NA 37 NA

The Output I'm looking for

Version 1's data enter image description here

Version 2's data enter image description here

Trying to figure this out

I've used dplyr::spread() and even the new experimental pivot_wider() (inspired by this solution), but couldn't get it to number the occurrences of values along the variable, to be represented in the column names.

Ideally, a single solution would address both data versions I presented. It basically just needs to be agnostic to the number of values each id has in each column, and let the data dictate... I think it's a simple problem, but I just can't wrap my head around this.

Thanks!!!

Emman
  • 3,695
  • 2
  • 20
  • 44
  • 1
    `tidyr::gather`, filter rows with x, y, and z NAs, mutate then spread. see https://stackoverflow.com/questions/38260716/how-to-gather-then-mutate-a-new-column-then-spread-to-wide-format-again – A. Suliman Aug 15 '19 at 11:51

1 Answers1

3

The following is a solution based on @A.Suliman comment.

library(tidyr)
library(dplyr)

d %>% 
  # Combine all values besides id in one column
  gather(key, value, -id) %>% 
  # Filter rows without a value
  filter(!is.na(value)) %>% 
  group_by(id, key) %>% 
  # Create a new key variable numbering the key for each id
  mutate(key_new = paste0(key, seq_len(n()))) %>% 
  ungroup() %>% 
  select(-key) %>% 
  # Spread the data with the new key
  spread(key_new, value)

# A tibble: 4 x 13
#      id    x1    x2    x3    x4    y1    y2    y3    y4    z1    z2    z3    z4
#   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1     1    38    17    32    75    63    27    81     2    61    76    89    53
# 2     2    40    28    71    NA    75    47    67    NA     4    70    33    NA
# 3     3    95    41    97    NA    26     7     8    NA    82    99    68    NA
# 4     4    93    NA    NA    NA    38    NA    NA    NA    58    NA    NA    NA

For d2 instead of d it gives:

# A tibble: 4 x 13
#      id    x1    x2    x3    x4    y1    y2    y3    y4    z1    z2    z3    z4
#   <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1     1    18    85    55    58    28    93    59    15     5    22    67    77
# 2     2    88    13    32    40     7    14    80    74    91    71    69    NA
# 3     3    96    73    52    NA    66    NA    NA    NA    76    NA    NA    NA
# 4     4    56    53    39    NA    12    42    99    37    16    84    NA    NA
kath
  • 7,624
  • 17
  • 32
  • Awesome, thanks. I was missing the critical part of `mutate(key_new = paste0(key, seq_len(n())))`, and instead hoped that some unseen magic would know to do the numbering automatically without specifically asking for it. – Emman Aug 15 '19 at 12:35