1

I'm trying to make predictions with my testing data using my finalized workflow. But whenever I try using the predict function, it gives me this error:

Error in `step_log()`:
! The following required column is missing from `new_data` in step 'log_79Q8u': shares.

The shares variable is present in my testing dataset.

Do I need to cahnge my recipe and retune my model?? This is for my final and I really need to resolve this error would appreciate any advice!!

My code for the recipe and the prediction is below:

# recipe 
recipe_kc <- recipe(shares ~ ., data = articles_train) %>% 
  step_log(shares) %>% 
step_normalize(all_numeric_predictors()) %>%  
  step_zv(all_predictors()) 

# selecting best model
best_workflow <- bt_tuned %>% 
  extract_workflow_set_result("recipe3_bt") %>% 
  select_best(metric = "rmse", "rsq")

best_workflow

final_workflow <- bt_tuned %>% 
  extract_workflow("recipe3_bt") %>% 
  finalize_workflow(best_workflow)


final_fit <- fit(final_workflow, articles_train)


# using testing data
final_pred <- articles_test %>% 
  select(shares) %>% 
  bind_cols(predict(final_fit, new_data = articles_test)) %>% 
  mutate(
    .pred_log = .pred,
    .pred = 10^.pred_log
  ) %>% 
  summarize(.pred, shares, shares_log,.pred_log) 
Phil
  • 7,287
  • 3
  • 36
  • 66
Chelsea Lu
  • 11
  • 1
  • Difficult to help without a reproducible example, but is there a reason why you remove all variables except `shares` by running `select(shares)` before `predict()`? – Phil Mar 16 '23 at 22:10
  • @Phil I just wanted to select the predictor variable, regardless even if I just do predict() by itself, I am still getting that error. I can also post an example of my workflow for my model if that helps – Chelsea Lu Mar 16 '23 at 22:22
  • Oh i see now, sorry never mind my initial question. – Phil Mar 16 '23 at 22:27
  • What would really help is an example of your data and your workflow using `dput()`. – Phil Mar 16 '23 at 22:29
  • more on reproducibility: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – william3031 Mar 17 '23 at 00:34

1 Answers1

1

You are getting a problem because you are transforming the outcome inside the recipe. It is generally advised that you don't perform simple transformations on the outcome inside the recipe and it can cause problems as you have seen.

Instead I recommend that you do the transformation before you split your data, this way you won't run into problem when the outcome isn't available to transform

set.seed(3467) 
articles_split <- article %>%
  # Log the outcome
  mutate(shares = log(shares)) %>%
  initial_split()

articles_train <- training(articles_split)

EmilHvitfeldt
  • 2,555
  • 1
  • 9
  • 12