0

I have a dataset that looks like this:

ID           | age | disease
smith192     | 17  | lung_cancer
green484     | 12  | diabetes
green484     | 13  | heart_irregularities
tom584       | 12  | colon_cancer
tom584       | 14  | diabetes
tom584       | 15  | malnutrition

And I would like R to organize it into this:

ID           | age_1 | disease_1    | age_2 | disease_2            | age_3 | disease_3    |
smith192     | 17    | lung_cancer  | NA    | NA                   | NA    | NA           |
green484     | 12    | diabetes     | 13    | heart_irregularities | NA    | NA           |
tom584       | 12    | colon_cancer | 14    | diabetes             | 15    | malnutrition |

Any help would be greatly appreciated!

Darren Tsai
  • 32,117
  • 5
  • 21
  • 51
Aaron Z
  • 11
  • 2

1 Answers1

0

You could create disease indices for each ID and then pivot the data to wide.

  • base
df |>
  transform(n = ave(ID, ID, FUN = seq)) |>
  reshape(direction = "wide", idvar = "ID", timevar = "n", v.names = c("age", "disease"))

#         ID age.1    disease.1 age.2            disease.2 age.3    disease.3
# 1 smith192    17  lung_cancer    NA                 <NA>    NA         <NA>
# 2 green484    12     diabetes    13 heart_irregularities    NA         <NA>
# 4   tom584    12 colon_cancer    14             diabetes    15 malnutrition
  • tidyverse
library(dplyr)
library(tidyr)

df %>%
  group_by(ID) %>%
  mutate(n = 1:n()) %>%
  ungroup() %>%
  pivot_wider(ID, names_from = n, values_from = c(age, disease))

# # A tibble: 3 × 7
#   ID       age_1 age_2 age_3 disease_1    disease_2            disease_3
#   <chr>    <dbl> <dbl> <dbl> <chr>        <chr>                <chr>
# 1 smith192    17    NA    NA lung_cancer  NA                   NA
# 2 green484    12    13    NA diabetes     heart_irregularities NA
# 3 tom584      12    14    15 colon_cancer diabetes             malnutrition

Data
df <- structure(list(ID = c("smith192", "green484", "green484", "tom584",
"tom584", "tom584"), age = c(17, 12, 13, 12, 14, 15), disease = c("lung_cancer",
"diabetes", "heart_irregularities", "colon_cancer", "diabetes",
"malnutrition")), class = "data.frame", row.names = c(NA, -6L))
Darren Tsai
  • 32,117
  • 5
  • 21
  • 51
  • Have you used `values_fn` in cases where there aren't unique `ID` rows? – Karthik S Sep 23 '22 at 04:36
  • @KarthikS It's not needed here. I used `group_by(ID) %>% mutate(n = 1:n())` to create unique indices for each `ID`, so each pair of `age` and `disease` is pivoted to new columns. – Darren Tsai Sep 23 '22 at 05:42
  • Yeah, I used to do the same, but now wondering if creating new column is more efficient than using `values_fn`, is it's all based on individual preference. – Karthik S Sep 23 '22 at 05:45
  • 1
    @KarthikS I like the feature of `values_fn` too, but it is not applicable here (At least I don't know how to). In this case `id_cols` must be `ID` and `values_from` be `c(age, disease)`. If you don't create a new column indicating disease indices, there is nothing put into `names_from`. – Darren Tsai Sep 23 '22 at 05:55