0
mydata <- data.frame(id = c(1,1,1,2,2,3,4),
                     hobby = c("music", "sports", "science", "science", "lifestyle", 
                               "party", "sports"),
                     x = c(10, 10, 10, 23, 23, 11, 0),
                     y = c(78, 78, 78, 55, 55, 22, 9))

> mydata
  id     hobby  x  y
1  1     music 10 78
2  1    sports 10 78
3  1   science 10 78
4  2   science 23 55
5  2 lifestyle 23 55
6  3     party 11 22
7  4    sports  0  9

I have a data.frame that's in a long format with 5 different unique hobbies: music, sports, science, lifestyle, and party. What's a quick way in R to obtain 5 data.frames, one for each hobby that's populated with 0/1?

The reason for this is that I want to run the following regression model 5 separate times. One for each unique hobby:

glm(y ~ hobby + offset(x), family = "poisson", data = dat_music))
glm(y ~ hobby + offset(x), family = "poisson", data = dat_sports))
glm(y ~ hobby + offset(x), family = "poisson", data = dat_science))
glm(y ~ hobby + offset(x), family = "poisson", data = dat_lifestyle))
glm(y ~ hobby + offset(x), family = "poisson", data = dat_party))

For each hobby, I want to summarize the data where each row corresponds to a unique id.

For dat_music, I want to have:

  id hobby  x  y
1  1     1 10 78
2  2     0 23 55
3  3     0 11 22
4  4     0  0  9

For dat_sports, I want to have:

  id hobby  x  y
1  1     1 10 78
2  2     0 23 55
3  3     0 11 22
4  4     1  0  9

And so forth? Suppose in reality, I have 50k unique hobbies. What's an efficient way to do this in R?

Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
Adrian
  • 9,229
  • 24
  • 74
  • 132
  • Are the sample results meant to correspond to the sample input? Why are there 4 rows in each result, and why does the first row get a `1` for both "music" and "sports", but the 4th row is different? – Gregor Thomas Oct 17 '22 at 20:01
  • Also, why do you want to do this? If you have 50k hobbies, do you really need 50k data frames? It's generally more efficient to **not** duplicate data 50k times. Whatever you're trying to do, there may be a more efficient way (one-hot encoding in a single data frame, creating the individual data frames on-demand and not saving them in memory after they have been used... it all depends on why you want to do this.) – Gregor Thomas Oct 17 '22 at 20:04
  • There are 4 rows because there's 1 row per unique `id`. Since only `id=1` has `music` as a hobby, then we assign `id=1` a 1 and 0 for everyone else. – Adrian Oct 17 '22 at 20:04
  • 1
    Ah, so it's summarized at the ID level. If a hobby is present in an ID it gets a 1, otherwise it gets a 0? That makes more sense – Gregor Thomas Oct 17 '22 at 20:06
  • I want to do this because I want to run a regression model, one for each unique hobby. For example, `glm(y ~ hobby, family = "poisson")` – Adrian Oct 17 '22 at 20:06
  • @GregorThomas I expanded on the reason behind wanting to do this in the original post. – Adrian Oct 17 '22 at 20:10
  • I'd think it might be more efficient to go to wide format and use, e.g. `glm(y ~ sports, family = "poisson", data = wide_data)`. One data frame with 50k columns will be smaller than 50k data frames with 4 columns. But it's probably not a huge difference. – Gregor Thomas Oct 17 '22 at 20:17
  • @GregorThomas Thanks. I didn't think of that! I posted a question on converting from long to wide here if you would like to take a look: https://stackoverflow.com/questions/74102925/how-to-reshape-data-from-long-to-wide-with-0-1-entries – Adrian Oct 17 '22 at 20:32

1 Answers1

3

Here's a purrr/dplyr version:

library(dplyr)
library(purrr)

## group the data in advance
mydata = mydata %>% group_by(id, x, y)

hobbies = unique(mydata$hobby)
results = map(
  .x = set_names(hobbies),
  .f = \(hobby_i) mydata %>% 
    summarize(
      hobby = as.integer(hobby_i %in% hobby),
      .groups = "drop"
    )
)
results
# $music
# # A tibble: 4 × 4
#      id     x     y hobby
#   <dbl> <dbl> <dbl> <int>
# 1     1    10    78     1
# 2     2    23    55     0
# 3     3    11    22     0
# 4     4     0     9     0
# 
# $sports
# # A tibble: 4 × 4
#      id     x     y hobby
#   <dbl> <dbl> <dbl> <int>
# 1     1    10    78     1
# 2     2    23    55     0
# 3     3    11    22     0
# 4     4     0     9     1
# 
# $science
# # A tibble: 4 × 4
#      id     x     y hobby
#   <dbl> <dbl> <dbl> <int>
# 1     1    10    78     1
# 2     2    23    55     1
# 3     3    11    22     0
# 4     4     0     9     0
# 
# $lifestyle
# # A tibble: 4 × 4
#      id     x     y hobby
#   <dbl> <dbl> <dbl> <int>
# 1     1    10    78     0
# 2     2    23    55     1
# 3     3    11    22     0
# 4     4     0     9     0
# 
# $party
# # A tibble: 4 × 4
#      id     x     y hobby
#   <dbl> <dbl> <dbl> <int>
# 1     1    10    78     0
# 2     2    23    55     0
# 3     3    11    22     1
# 4     4     0     9     0
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • `Error: unexpected input in: " .x = set_names(hobbies), .f = \"` any thoughts on this? – Adrian Oct 17 '22 at 21:26
  • If you're using an R version before 4.1, use `function(hobby_i)` instead of `\(hobby_i)`. R 4.1.0 introduced a shortcut syntax `\(x)` for `function(x)`. – Gregor Thomas Oct 17 '22 at 21:35