I'm trying to reformat my dataset from long to wide format, but while this is one of the most discussed topics, I couldn't find a solution for my case, nor to generalize from methods others have used.
My data is in long format, where each ID has a different number of rows (relative to other IDs.) I want to transform to wide format where each ID has one row, and the data is represented by columns with a suffix that reflects the order each value appears per ID.
Notice that the NAs values don't necessarily correspond between the two formats. In the long format, NAs are simply missing from data; but in the wide format, NAs appear where values for that id fall short in filling the number of values other IDs might have for the variable x.
My Data
In real life, my data has more than one variable, and it could come in one of two versions:
Version 1 :: For each ID, values appear at the same row across variables
## reproducible data
set.seed(125)
runs_per_id <- sample(5:9, 4, replace = TRUE)
id <- rep(1:4, times = runs_per_id)
set.seed(300)
is_value <- sample (c(0, 1), size = length(id), replace = TRUE)
x <- is_value
x[which(as.logical(is_value))] <- sample(1:100, size = sum(x))
y <- is_value
y[which(as.logical(is_value))] <- sample(1:100, size = sum(y))
z <- is_value
z[which(as.logical(is_value))] <- sample(1:100, size = sum(z))
d <- as.data.frame(cbind(id, x, y, z))
d[d == 0] <- NA
d
# id x y z
# 1 1 38 63 61
# 2 1 17 27 76
# 3 1 32 81 89
# 4 1 NA NA NA
# 5 1 75 2 53
# 6 1 NA NA NA
# 7 2 NA NA NA
# 8 2 40 75 4
# 9 2 NA NA NA
# 10 2 NA NA NA
# 11 2 28 47 70
# 12 2 NA NA NA
# 13 2 71 67 33
# 14 3 NA NA NA
# 15 3 95 26 82
# 16 3 NA NA NA
# 17 3 41 7 99
# 18 3 97 8 68
# 19 4 NA NA NA
# 20 4 NA NA NA
# 21 4 93 38 58
# 22 4 NA NA NA
# 23 4 NA NA NA
Version 2 :: For each ID, values don't necessarily appear at the same row across variables
## reproducible data based on generating d from above
set.seed(12)
d2 <- data.frame(replicate(3, sample(0:1,length(id),rep=TRUE)))
d2[d2 != 0] <- sample(1:100, size = sum(d2 != 0))
d2[d2 == 0] <- NA
colnames(d2) <- c("x", "y", "z")
d2 <- as.data.frame(cbind(id, d2))
d2
## id x y z
## 1 1 18 28 5
## 2 1 85 93 22
## 3 1 55 59 NA
## 4 1 NA NA 67
## 5 1 NA 15 77
## 6 1 58 NA NA
## 7 2 NA 7 NA
## 8 2 NA NA 91
## 9 2 88 14 NA
## 10 2 13 NA NA
## 11 2 32 NA NA
## 12 2 NA 80 71
## 13 2 40 74 69
## 14 3 NA NA NA
## 15 3 96 NA 76
## 16 3 NA NA NA
## 17 3 73 66 NA
## 18 3 52 NA NA
## 19 4 56 12 16
## 20 4 53 NA NA
## 21 4 NA 42 84
## 22 4 39 99 NA
## 23 4 NA 37 NA
The Output I'm looking for
Trying to figure this out
I've used dplyr::spread()
and even the new experimental pivot_wider()
(inspired by this solution), but couldn't get it to number the occurrences of values along the variable, to be represented in the column names.
Ideally, a single solution would address both data versions I presented. It basically just needs to be agnostic to the number of values each id
has in each column, and let the data dictate... I think it's a simple problem, but I just can't wrap my head around this.
Thanks!!!