0

I have seen several questions + answers for similar posts in SO (ex. 1, ex. 2, ex. 3), but none seem to really address the problem in the context of tidymodels.

I am trying to use a second-order step_poly function inside a preprocessing recipe to prepare for a KNN model. The sample data is pulled from a Kaggle Playground competition. The training data itself is ~360,000 x 17 with all numeric predictors.

A light preprocessing reprex is:

rec <- recipe(cost ~ ., data = train) |> 
  update_role(id, new_role = 'id') |>
  step_normalize(all_numeric_predictors())
  step_poly(all_predictors()) |> # this line fails??
  step_interact(~ all_predictors():all_predictors())

When going to prep the recipe, prep(rec), an error is thrown:

Error in poly(degree = 2L, x = c(0.871948016751444, 0.871948016751444, : 'degree' must be less than number of unique points

This also persists at tuning time. I understand the rationale behind why the polynomial degree must be less than the number of unique points, but I do not understand where the "unique points" are coming from. Why does my data only have a single unique point? And how can I fix this?

Any and all help is greatly appreciated!

codeweird
  • 145
  • 3
  • 11

1 Answers1

1

You are correct in seeing that the problem comes from not having enough unique values in the columns you are trying to apply step_poly() to.

The default value of degree in step_poly() is 2, so it can only be apply to variables with at least 3 unique values.

We can use the function n_distinct() inside a map to find the number of distinct values for each variable.

library(tidymodels)

train <- readr::read_csv("~/Desktop/train.csv.zip")
train <- janitor::clean_names(train)

train |> 
 select(where(is.numeric)) |>
 map_dbl(n_distinct) |>
  sort()
#>        recyclable_package                   low_fat                coffee_bar 
#>                         2                         2                         2 
#>               video_store                 salad_bar             prepared_food 
#>                         2                         2                         2 
#>                   florist avg_cars_at_home_approx_1    unit_sales_in_millions 
#>                         2                         5                         6 
#>            total_children      num_children_at_home                store_sqft 
#>                         6                         6                        20 
#>            units_per_case                      cost              gross_weight 
#>                        36                       328                       384 
#>   store_sales_in_millions                        id 
#>                      1044                    360336

We see a lot of them just have 2 values, so you will have to manually specify which variables to have it applied to

rec <- recipe(cost ~ ., data = train) |> 
  update_role(id, new_role = 'id') |>
  step_normalize(all_numeric_predictors()) |>
  step_poly(cost, gross_weight, store_sales_in_millions) |>
  step_interact(~ all_predictors():all_predictors())

rec |>
  prep()
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs 
#> Number of variables by role
#> outcome:    1
#> predictor: 15
#> id:         1
#> 
#> ── Training information 
#> Training data contained 360336 data points and no incomplete rows.
#> 
#> ── Operations 
#> • Centering and scaling for: store_sales_in_millions, ... | Trained
#> • Orthogonal polynomials on: cost, gross_weight, ... | Trained
#> • Interactions with: (unit_sales_in_millions + total_children +
#>   num_children_at_home + avg_cars_at_home_approx_1 + recyclable_package +
#>   low_fat + units_per_case + store_sqft + coffee_bar + video_store + salad_bar
#>   + prepared_food + florist + cost_poly_1 + cost_poly_2 + gross_weight_poly_1 +
#>   gross_weight_poly_2 + store_sales_in_millions_poly_1 +
#>   store_sales_in_millions_poly_2):(unit_sales_in_millions + total_children +
#>   num_children_at_home + avg_cars_at_home_approx_1 + recyclable_package +
#>   low_fat + units_per_case + store_sqft + coffee_bar + video_store + salad_bar
#>   + prepared_food + florist + cost_poly_1 + cost_poly_2 + gross_weight_poly_1 +
#>   gross_weight_poly_2 + store_sales_in_millions_poly_1 +
#>   store_sales_in_millions_poly_2) | Trained

Created on 2023-05-01 with reprex v2.0.2

EmilHvitfeldt
  • 2,555
  • 1
  • 9
  • 12