I'm working on a text classification project, and I've been doing everything under the tidymodels framework. Right now, I'm trying to investigate whether or not particular data points are being consistently mislabeled across the board. To do this, I want to get into the saved predictions for individual samples. When I perform resampling and use collect_predictions, while I see a list that contains the predicted label and the actual label for each of the data points, the identity of the data points themselves are still hidden. There's one column that may trace back (.row), but I'm having trouble confirming this.
I've been generating my resampling strategy as follows:
grades_split <- initial_split(tabled_texts2, strata = grade)
grades_train <- training(grades_split)
grades_test <- testing(grades_split)
folds <- vfold_cv(grades_train)
Then, after tuning and fitting the model, I generate the resamples object:
fitted_grades <- fit(final_wf, grades_train)
LR_rs <- fit_resamples(
fitted_grades,
folds,
control = control_resamples(save_pred = TRUE)
)
Finally, I examine the predictions like this:
predictions <- collect_predictions(LR_rs)
View(predictions)
I get a table that looks like this:
id | .pred_4 | .pred_not 4 | .row | .pred_class | grade | .config |
---|---|---|---|---|---|---|
Fold01 | 0.502905 | 0.497095 | 18 | 4 | 4 | Preprocessor1_Model1 |
Fold01 | 0.484647 | 0.515353 | 22 | not 4 | 4 | Preprocessor1_Model1 |
Fold01 | 0.481496 | 0.518504 | 23 | not 4 | 4 | Preprocessor1_Model1 |
Fold01 | 0.492314 | 0.507686 | 40 | not 4 | 4 | Preprocessor1_Model1 |
Fold01 | 0.477215 | 0.522785 | 52 | not 4 | 4 | Preprocessor1_Model1 |
How could I map these values back to the original data?
Here is an analogous reprex. In this example, I would like to be able to see specifically which of the penguins are being misclassified, not just an arbitrary .row value (which I'm pretty sure doesn't map back 1-1 to the original dataset)
library(tidyverse)
library(tidymodels)
library(tidytext)
library(modeldata)
library(naivebayes)
library(discrim)
set.seed(1)
data("penguins")
View(penguins)
nb_spec <- naive_Bayes() %>%
set_mode('classification') %>%
set_engine('naivebayes')
fitted_wf <- workflow() %>%
add_formula(species ~ island + flipper_length_mm) %>%
add_model(nb_spec) %>%
fit(penguins)
split <- initial_split(penguins)
train <- training(split)
test <- testing(split)
folds <- vfold_cv(train)
NB_rs <- fit_resamples(
fitted_wf,
folds,
control = control_resamples(save_pred = TRUE)
)
predictions <- collect_predictions(NB_rs)
View(predictions)