-1

This is the dataset called NFL, I tried to run XG Boost, but the error showed me:

Error in xgb.DMatrix(X_Train, label = labels) : 'data' has class 'character' and length 64617. 'data' accepts either a numeric matrix or a single filename.

The raw dataset is called NFL I'm trying to set "outcome" as predictor, and I want to make it as numeric. The "outcome" variable has "Win", "Tie", "Loss", I'm trying to show it in dataset as "1", "2", "3"

Here is the code

NFL <- NFL %>% mutate(id = row_number())
#Devided in two groups: TrainSet and validate
trainSet <- train %>% sample_frac(0.7)
validate <- train %>% anti_join(trainSet)

#xg boost    
set.seed(112321)

X_Train <- trainSet %>% select(-outcome) %>% as.matrix()
X_Test <- validate %>% select(-target) %>% as.matrix()
labels <- trainSet$outcome %>% as.matrix()
Train <- xgb.DMatrix(X_Train, label = labels)


xgbModel <- xgboost(data = trainSet, objective = "classification" , 
nrounds = 50, subsample=1, colsample_bytree = 1, max_depth = 10, 
eta=0.2, verbose=FALSE)

xgbPred <- predict(xgbModel, validate)
xgbROC <- evaluate(xgbPred, validate$target)enter code here

Can anybody tell me how to fix this? Thank you very much!

Update: I tried to use:

NFL%>% mutate(outcome = ifelse(outcome, c("Win", "Tie", "Loss",1,2,3)))

But it comes with all NAs, here is the photo NA/s

Vinícius Félix
  • 8,448
  • 6
  • 16
  • 32
  • `match(dat$outcome, c("Win", "Tie", "Loss"))` – r2evans Sep 21 '21 at 19:02
  • It comes with a warning: Warning in match(., NFL$outcome, c("Win", "Tie", "Loss")) : NAs introduced by coercion – usersquash003 Sep 21 '21 at 19:08
  • Yeah, if you provide a picture of data, there's not much more you can expect. Please do not post an image of code/data/errors: it breaks screen-readers and it cannot be copied or searched (ref: https://meta.stackoverflow.com/a/285557 and https://xkcd.com/2116/). Please just include the code, console output, or data (e.g., `data.frame(...)` or the output from `dput(head(x))`) directly. – r2evans Sep 21 '21 at 19:17
  • Thank you for your guidance! I edited my question and hopefully it helps! – usersquash003 Sep 21 '21 at 20:01
  • `Error: object 'NFL' not found` – r2evans Sep 21 '21 at 20:02
  • It's an rds file, I'm not sure how to upload to the question... – usersquash003 Sep 21 '21 at 20:06
  • Do we really need your whole dataset to demonstrate a concept? How about pasting the output from `dput(x)` where `x` is something like `head(NFL,20)` (or some collection of rows where we see enough variability to fully demonstrate the results). – r2evans Sep 21 '21 at 20:10
  • See https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info for more discussion on the "reproducible question" theme. – r2evans Sep 21 '21 at 20:10
  • I think the whole dataset is not necessary. My only question is to figure out how to convert "win" "tie" "loss" in outcome variable to "1", "2", "3" – usersquash003 Sep 21 '21 at 20:40
  • You've been given two options: `as.numeric(factor(.))` and `match(.)`. BTW, `match` and `factor` are **not** dplyr-verbs; your comment above that says `Warning in match(., NFL$outcome, ..)`, it looks like you tried `NFL %>% match(...)`, which is not correct (and not what was recommended). – r2evans Sep 21 '21 at 21:18

3 Answers3

1

I think the general solution is to convert to factors, and then convert to numeric.

As an example

data <- data.frame(outcome = c("Win", "Tie", "Loss"), other_cols = runif(3))
data$outcome <- as.numeric(factor(data$outcome, levels=c("Win", "Tie", "Loss")))
head(data)
#>   outcome other_cols
#> 1       1 0.08823792
#> 2       2 0.98049935
#> 3       3 0.61575916

Created on 2021-09-22 by the reprex package (v2.0.1)

walter
  • 518
  • 3
  • 8
  • Thank you very much! I found the rest of variables in the dataset are gone after I run the codes. Is there any way to convert "outcome" to numeric while keeping the rest of the variables in the dataset? – usersquash003 Sep 21 '21 at 19:24
  • Reload your original data, don't run walter's first line of code (that creates `data`, since you provided no usable data), then run the `as.numeric` line. Are you saying that that line results in all of your other columns being deleted? – r2evans Sep 21 '21 at 19:58
1

For xgboost, I recommend using the tidymodels packages for preprocessing. You're also more likely to get interpretable/meaningful results if you convert unordered categorical variables to dummy variables (one column per category) rather than a single numeric column (unless the factor is ordered). For example:

library(tidymodels)

rec <- recipe(outcome_variable ~ ., data = train) %>% 
  step_normalize(all_numeric(), -all_outcomes()) %>% 
  step_dummy(all_nominal(), -all_outcomes())

processed_training_data <- prep(rec) %>% juice()

...will return an updated version of your training data will all categorical variables converted to dummy variables that can be read by xgboost() and the optional step_normalize() will center and scale the numeric predictor variables.

huttoncp
  • 161
  • 4
  • For detailed examples on using the tidymodels packages to fit/tune xgboost models, I recommend checking out Julia Silge's blog: https://juliasilge.com/blog/xgboost-tune-volleyball/ – huttoncp Sep 21 '21 at 21:11
0

You can recode outcome into a separate numeric variable. Then, replace character variable with numeric variable in xgboost model process and it should run without error message:

NFL$outcome2[NFL$outcome=="Win"] <- 1
NFL$outcome2[NFL$outcome=="Tie"] <- 2
NFL$outcome2[NFL$outcome=="Loss"] <- 3
Dan Tarr
  • 209
  • 3
  • 8