How to efficiently convert long data.frame to individual data.frames with 0/1

Question

mydata <- data.frame(id = c(1,1,1,2,2,3,4),
                     hobby = c("music", "sports", "science", "science", "lifestyle", 
                               "party", "sports"),
                     x = c(10, 10, 10, 23, 23, 11, 0),
                     y = c(78, 78, 78, 55, 55, 22, 9))

> mydata
  id     hobby  x  y
1  1     music 10 78
2  1    sports 10 78
3  1   science 10 78
4  2   science 23 55
5  2 lifestyle 23 55
6  3     party 11 22
7  4    sports  0  9

I have a data.frame that's in a long format with 5 different unique hobbies: music, sports, science, lifestyle, and party. What's a quick way in R to obtain 5 data.frames, one for each hobby that's populated with 0/1?

The reason for this is that I want to run the following regression model 5 separate times. One for each unique hobby:

glm(y ~ hobby + offset(x), family = "poisson", data = dat_music))
glm(y ~ hobby + offset(x), family = "poisson", data = dat_sports))
glm(y ~ hobby + offset(x), family = "poisson", data = dat_science))
glm(y ~ hobby + offset(x), family = "poisson", data = dat_lifestyle))
glm(y ~ hobby + offset(x), family = "poisson", data = dat_party))

For each hobby, I want to summarize the data where each row corresponds to a unique id.

For dat_music, I want to have:

  id hobby  x  y
1  1     1 10 78
2  2     0 23 55
3  3     0 11 22
4  4     0  0  9

For dat_sports, I want to have:

  id hobby  x  y
1  1     1 10 78
2  2     0 23 55
3  3     0 11 22
4  4     1  0  9

And so forth? Suppose in reality, I have 50k unique hobbies. What's an efficient way to do this in R?

Are the sample results meant to correspond to the sample input? Why are there 4 rows in each result, and why does the first row get a `1` for both "music" and "sports", but the 4th row is different? — Gregor Thomas, Oct 17 '22 at 20:01
Also, why do you want to do this? If you have 50k hobbies, do you really need 50k data frames? It's generally more efficient to **not** duplicate data 50k times. Whatever you're trying to do, there may be a more efficient way (one-hot encoding in a single data frame, creating the individual data frames on-demand and not saving them in memory after they have been used... it all depends on why you want to do this.) — Gregor Thomas, Oct 17 '22 at 20:04
There are 4 rows because there's 1 row per unique `id`. Since only `id=1` has `music` as a hobby, then we assign `id=1` a 1 and 0 for everyone else. — Adrian, Oct 17 '22 at 20:04
Ah, so it's summarized at the ID level. If a hobby is present in an ID it gets a 1, otherwise it gets a 0? That makes more sense — Gregor Thomas, Oct 17 '22 at 20:06
I want to do this because I want to run a regression model, one for each unique hobby. For example, `glm(y ~ hobby, family = "poisson")` — Adrian, Oct 17 '22 at 20:06
@GregorThomas I expanded on the reason behind wanting to do this in the original post. — Adrian, Oct 17 '22 at 20:10
I'd think it might be more efficient to go to wide format and use, e.g. `glm(y ~ sports, family = "poisson", data = wide_data)`. One data frame with 50k columns will be smaller than 50k data frames with 4 columns. But it's probably not a huge difference. — Gregor Thomas, Oct 17 '22 at 20:17
@GregorThomas Thanks. I didn't think of that! I posted a question on converting from long to wide here if you would like to take a look: https://stackoverflow.com/questions/74102925/how-to-reshape-data-from-long-to-wide-with-0-1-entries — Adrian, Oct 17 '22 at 20:32

score 3 · Accepted Answer · answered Oct 17 '22 at 20:16

Here's a purrr/dplyr version:

library(dplyr)
library(purrr)

## group the data in advance
mydata = mydata %>% group_by(id, x, y)

hobbies = unique(mydata$hobby)
results = map(
  .x = set_names(hobbies),
  .f = \(hobby_i) mydata %>% 
    summarize(
      hobby = as.integer(hobby_i %in% hobby),
      .groups = "drop"
    )
)
results
# $music
# # A tibble: 4 × 4
#      id     x     y hobby
#   <dbl> <dbl> <dbl> <int>
# 1     1    10    78     1
# 2     2    23    55     0
# 3     3    11    22     0
# 4     4     0     9     0
# 
# $sports
# # A tibble: 4 × 4
#      id     x     y hobby
#   <dbl> <dbl> <dbl> <int>
# 1     1    10    78     1
# 2     2    23    55     0
# 3     3    11    22     0
# 4     4     0     9     1
# 
# $science
# # A tibble: 4 × 4
#      id     x     y hobby
#   <dbl> <dbl> <dbl> <int>
# 1     1    10    78     1
# 2     2    23    55     1
# 3     3    11    22     0
# 4     4     0     9     0
# 
# $lifestyle
# # A tibble: 4 × 4
#      id     x     y hobby
#   <dbl> <dbl> <dbl> <int>
# 1     1    10    78     0
# 2     2    23    55     1
# 3     3    11    22     0
# 4     4     0     9     0
# 
# $party
# # A tibble: 4 × 4
#      id     x     y hobby
#   <dbl> <dbl> <dbl> <int>
# 1     1    10    78     0
# 2     2    23    55     0
# 3     3    11    22     1
# 4     4     0     9     0

`Error: unexpected input in: " .x = set_names(hobbies), .f = \"` any thoughts on this? — Adrian, Oct 17 '22 at 21:26
If you're using an R version before 4.1, use `function(hobby_i)` instead of `\(hobby_i)`. R 4.1.0 introduced a shortcut syntax `\(x)` for `function(x)`. — Gregor Thomas, Oct 17 '22 at 21:35

How to efficiently convert long data.frame to individual data.frames with 0/1

1 Answers1