1

I am trying to create pairways interactions between each field of a dataset for a glmnet model, without having to name each field individually. However, when it tries to perform this automatically, it gets hung up on creating them for all the variants of one-hot encoded categorical variables against themselves (eg it creates an interaction column between Gender_Male and Gender_Female, and then can't find any values so the entire thing is filled with NaNs) which then makes glmnet throw an error.

Here's some sample code:

library(dplyr)
library(tidyr)
library(rsample)
library(recipes)
library(glmnet)

head(credit_data)

t <- credit_data %>%
  mutate(Status = as.character(Status)) %>%
  mutate(Status = if_else(Status == "good", 1, 0)) %>%
  drop_na()

set.seed(1234)
partitions <- initial_split(t, prop = 9/10, strata = "Status")

parsed_recipe <- recipe(Status ~ ., data = t)  %>%
  step_dummy(one_hot = TRUE, all_predictors(), -all_numeric()) %>%
  step_interact(~.:.) %>% #My attempt to apply the interaction
  step_scale(all_predictors()) %>%
  prep(training = training(partitions))

train_data <- bake(parsed_recipe, new_data = training(partitions))
test_data <- bake(parsed_recipe, new_data = testing(partitions))

fit <- train_data %>%
  select(-Status) %>%
  as.matrix() %>%
  glmnet(x = ., y = train_data$Status, family = "binomial", alpha = 0)

When I run the glmnet section at the end, it gives me this error:

Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,  : 
  NA/NaN/Inf in foreign function call (arg 5)

Having looked at this question, I realised there must be NAs / NaNs in the data, so I ran summary(train_data), which came out looking like this:

enter image description here

So, it's unsurprising that glmnet is upset, but I'm also not sure how to fix it. I really don't want to manually define every single pairing myself. Is there a recipes command to remove potential predictor columns containing NaNs, maybe?

Margaret
  • 5,749
  • 20
  • 56
  • 72

1 Answers1

0

I'm not sure if it's a perfect (or even good) solution, but I used the answer here to find the columns that contained NAs and then removed them wholesale.

So the bit after parsed_recipe was switched to this:

interim_train <- bake(parsed_recipe, new_data = training(partitions))

columns_to_remove <- colnames(interim_train)[colSums(is.na(interim_train)) > 0]

train_data <- interim_train %>%
  select(-columns_to_remove)

summary(train_data)

test_data <- bake(parsed_recipe, new_data = testing(partitions)) %>%
  select(-columns_to_remove)

Thus far it seems to be behaving in a more promising fashion.

Margaret
  • 5,749
  • 20
  • 56
  • 72