0

I'm new to R and I'm analyzing a dataset with qualitative and quantitative variables.The dataset is this one. I want to perform a Ridge regression, so I did this:

library(caret)

set.seed(3)
train_index <- sample(1:nrow(Data1), round(nrow(Data1) * 0.7))

train <- Data1[train_index, ]
nrow(train) / nrow(Data1)

test <- Data1[-train_index, ]
nrow(test) / nrow(Data1)

and then, to transform the qualitative variables with dummy:

train_mat <- dummyVars(`Time spent on social media` ~ ., data = train, fullRank = F) %>%
  predict(newdata = train) %>%
  as.matrix()

test_mat <- dummyVars(`Time spent on social media` ~ ., data = test, fullRank = F) %>%
  predict(newdata = test) %>%
  as.matrix()

The problem is that the train and test matrix have different numbers of variables and I don't understand why.

I thought there could be some problem with the dummy transformation so I used also dummy-cols but nothing changed

I_O
  • 4,983
  • 2
  • 2
  • 15
  • 4
    Welcome to SO, Stefano Primiterra! Questions on SO (especially in R) do much better if they are reproducible and self-contained. By that I mean including sample representative data (perhaps via `dput(head(x))` or building data programmatically (e.g., `data.frame(...)`), possibly stochastically), perhaps actual output (with verbatim errors/warnings) versus intended output. Refs: https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info. – r2evans Dec 28 '22 at 16:09
  • 2
    I'm going to suggest an edit to your question to remove all of the backslashes, as they are a hindrance/barrier-to-entry for people to try your code (not having sample data is another barrier). In the future, I'm not sure how the backslashes appeared in your code, but please don't add them explicitly. Thanks! (Help for formatting, in case it's useful: https://stackoverflow.com/editing-help and https://meta.stackexchange.com/a/22189) – r2evans Dec 28 '22 at 16:11
  • 1
    (And yes, I see the link to your dataset ... I believe many here on SO prefer to not go to sites for sample data. If you can reproduce your issue with a smaller and in-question dataset, it's often preferred. Thanks!) – r2evans Dec 28 '22 at 16:15

0 Answers0