How do I make my create a dataframe with multiple categorical variables and interaction effects, grouped by ID?

Question

I want to set up my dataframe so that it groups by my ID column, but have many columns for my categorical variables and interaction effects.

So this is how the original table looks like.

+----+----------------+---------+
| ID |      Page      |  Click  |
+----+----------------+---------+
|  1 | homepage       | logo    |
|  1 | homepage       | search  |
|  1 | category page  | logo    |
|  1 | category page  | search  |
|  2 | homepage       | logo    |
|  2 | homepage       | search  |
| .. |                |         | 
+----+----------------+---------+

I would like to make it into a table like this.

+----+----------------+--------------------+------------+---------------+-----------------+----------------------+---------------+-------------------+
| ID | Page_homepage  | Page_categorypage  | Click_logo | Click_search  | homepage:search | categorypage:search  | homepage:logo | categorypage:logo |
+----+----------------+--------------------+------------+---------------+-----------------+----------------------+---------------+-------------------+
|  1 |              1 |                  1 |          1 |             1 |               1 |                    1 |             1 |                 1 |
|  2 |              1 |                  0 |          1 |             1 |               1 |                    0 |             1 |                 0 |
+----+----------------+--------------------+------------+---------------+-----------------+----------------------+---------------+-------------------+

My objective is to be able to create features with interaction effects to perform a logistic regression. There are outputs associated with each ID, so it's important for me to group the results by ID.

What is the best and simplest way to do this? I don't want to manually do it for all the possible variations. I'm indifferent between using R/Python/SQL to perform this.

related: https://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format — jogo, May 22 '19 at 19:19
You may want to look at the formula interaction syntax for R models, which might be a lot easier than manually constructing such a table — Calum You, May 22 '19 at 20:14
*I'm indifferent between using R/Python/SQL to perform this* ... this sounds like you assume SO is a free coding service. Please make an earnest attempt at solution with a specific language and edit post with actual question including errors or undesired results so we can help. See [how to ask](https://stackoverflow.com/help/how-to-ask). — Parfait, May 22 '19 at 20:32
Possible duplicate of [How to reshape data from long to wide format](https://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format) — divibisan, May 23 '19 at 18:50

score 1 · Answer 1 · answered May 22 '19 at 20:20

One way to go about this is to do the individual variables and the interactions separately, then join them together:

library(tidyverse)
tbl <- structure(list(ID = c(1, 1, 1, 1, 2, 2), Page = c("homepage", "homepage", "categorypage", "categorypage", "homepage", "homepage"), Click = c("logo", "search", "logo", "search", "logo", "search")), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), spec = structure(list(cols = list(ID = structure(list(), class = c("collector_double", "collector")), Page = structure(list(), class = c("collector_character", "collector")), Click = structure(list(), class = c("collector_character", "collector")), X4 = structure(list(), class = c("collector_logical", "collector"))), default = structure(list(), class = c("collector_guess", "collector")), skip = 2), class = "col_spec"))
tbl
#> # A tibble: 6 x 3
#>      ID Page         Click 
#>   <dbl> <chr>        <chr> 
#> 1     1 homepage     logo  
#> 2     1 homepage     search
#> 3     1 categorypage logo  
#> 4     1 categorypage search
#> 5     2 homepage     logo  
#> 6     2 homepage     search

tbl %>%
  gather(variable, value, Page, Click) %>%
  transmute(ID, colname = str_c(variable, "_", value), presence = 1) %>%
  distinct() %>% # Individual variables now done, now add interactions
  bind_rows(transmute(tbl, ID, colname = str_c(Page, ":", Click), presence = 1)) %>%
  spread(colname, presence, fill = 0) %>%
  select(ID, matches("Page_"), matches("Click_"), matches(":"))
#> # A tibble: 2 x 9
#>      ID Page_categorypa… Page_homepage Click_logo Click_search
#>   <dbl>            <dbl>         <dbl>      <dbl>        <dbl>
#> 1     1                1             1          1            1
#> 2     2                0             1          1            1
#> # … with 4 more variables: `categorypage:logo` <dbl>,
#> #   `categorypage:search` <dbl>, `homepage:logo` <dbl>,
#> #   `homepage:search` <dbl>

^{Created on 2019-05-22 by the reprex package (v0.2.1)}

score 1 · Answer 2 · answered May 22 '19 at 21:06

Ok here is another approach. I was trying to make it work with as little assumptions about table column names and its size as it is possible. So the only assumption is that we have id column in the first column of the table and the rest of columns have type character just as in your example.


library(dplyr)
library(purrr)

df <- data.frame( id = c(1,1,2,2,2,3,3), page = c("home", "home", "home", "cat", "cat", "cat", "hat"), 
                  click = c("search", "logo", "search", "logo", "search", "banana", "banana") )

# auxiliary function for reshape
indicate <- function(x) {
  as.integer(!is_empty(x))
}

# column list for which we want to create the table
cols <- df %>% select(-id) %>% colnames()

# changing variable levels names
purrr::map(cols, function(colname) {
  df %>% pull(colname) %>% gsub("^", paste0(colname, "_"), .)
}) %>% bind_cols() %>% setNames(cols) %>% bind_cols(df %>% select(id), .) -> df2

# creating indicator column for each variable level
purrr::map(cols, function(colname) {
  form.string <- paste("id ~", colname)
  reshape2::dcast(df2, as.formula(form.string), indicate)
}) %>% bind_cols() %>% 
  select(-matches("id\\d+")) -> result

# creating formula for all interactions between variables and joining with the rest of analysis
formula <- paste0("id ~ ", paste(cols, collapse = "+")) %>% as.formula()
df %>% reshape2::dcast(., formula, indicate) %>%
  left_join(., result) -> final_results

print(final_results)

How do I make my create a dataframe with multiple categorical variables and interaction effects, grouped by ID?

2 Answers2