I have read many text supervised classification tutorials and I implemented tidytext, qunateda, tm, text2vec, RTextTools for my data. Until now, I have one unsolved puzzle. It seems there is no general consensus on when to tokenize the text data. Before or after the train-test split?. In one stack overflow post, some argued that it is even illegal to tokenize before you split. With dfm_match functions, quanteda package looks like it is designed to do the tokenization after splitting the data. Others recommend doing the split after preprocessing. I have seen nice tutorials by Julia Silge and Emil Hvitfeldt.
To me, it would save me many lines of code if I do the preprocessing step before the split. But, what are the risks? Data leakage or what? Is there any evidence comparing the two in terms of classification performance, validity, etc?

- 57,590
- 26
- 140
- 166

- 186
- 11
-
1I’m voting to close this question because it is not about programming as defined in the [help] but about ML theory and/or methodology - please see the intro & **NOTE** in the `machine-learning` [tag info](https://stackoverflow.com/tags/machine-learning/info). – desertnaut Dec 24 '20 at 11:04
1 Answers
"Illegal split"? Sounds interesting (and possibly fun) but I've never heard of that.
The question is: Under what circumstance could it make a difference, and how? Train-test splits partition the documents. Whether you tokenize them before you split or after may be irrelevant, since the documents will still contain the same tokens.
However, once you construct a matrix from those tokens, if you have done this after the split, then the feature sets of your model matrix may differ from those in your test set. To predict on a test set, the test data's features must conform to those from the training matrix. There are several possibilities for handling mismatches.
Feature in the training set, but absent from the test set. In quanteda.textmodels, which has a (convenient!) option in the
predict()
to make the prediction matrix conform to the training matrix automagically, this means that the test set will have this feature added but counted as zero. This is justified by considering that the feature added information to the training data, and its absence can be counted in the test data as informative.Feature not in the training set, but is in the test set. Most of the time, you would want to ignore this feature altogether. Why? Because there is no information about it in the trained model, so its effect will either be undefined, or due entirely to smoothing.
Getting back to the question, how can it make a difference? The main thing I can see is when a document-feature matrix (dfm) is formed and then split on rows, and some features in the training set are entirely zero (but not zero in the test set). If the split is done in a way that includes the zero-frequency features, then some supervised methods will smooth this so that their smoothed value is included.
Here's an example from quanteda, using the Naive Bayes classifier, which uses +1 smoothing by default on all features.
Let's form a simple dfm with two classes.
library("quanteda")
## Package version: 2.1.1
txt <- c(
d1 = "a a b b d",
d2 = "a a a b b",
d3 = "a b d d d",
d4 = "a a b c d"
)
y <- c("black", "black", "white", NA)
train <- c(TRUE, TRUE, TRUE, FALSE)
test <- !train
Now tokenize first, split after. Notice the feature c
is zero for entire three training set documents, but will be present in the training set if the index slicing is done on this combined dfm.
dfmat1 <- tokens(txt) %>%
dfm()
dfmat1
## Document-feature matrix of: 4 documents, 4 features (25.0% sparse).
## features
## docs a b d c
## d1 2 2 1 0
## d2 3 2 0 0
## d3 1 1 3 0
## d4 2 1 1 1
Feature c
is not included when the slicing is done prior to tokenization and forming the dfm.
dfmat2 <- tokens(txt[train]) %>%
dfm()
dfmat2
## Document-feature matrix of: 3 documents, 3 features (11.1% sparse).
## features
## docs a b d
## d1 2 2 1
## d2 3 2 0
## d3 1 1 3
The test matrix looks like this, and has more of the "white"-associated a
than the "black"-associated d
.
dfmattest <- tokens(txt[test]) %>%
dfm()
dfmattest
## Document-feature matrix of: 1 document, 4 features (0.0% sparse).
## features
## docs a b c d
## d4 2 1 1 1
Now when we train a model and predict, we see this when c
is included:
library("quanteda.textmodels")
tmod1 <- textmodel_nb(dfmat1, y)
coef(tmod1)
## black white
## a 0.42857143 0.2222222
## b 0.35714286 0.2222222
## d 0.14285714 0.4444444
## c 0.07142857 0.1111111
predict(tmod1, newdata = dfmattest, force = TRUE, type = "prob")
## black white
## d4 0.5526057 0.4473943
but the slightly different result when it is not:
tmod2 <- textmodel_nb(dfmat2, y[train])
coef(tmod2)
## black white
## a 0.4615385 0.25
## b 0.3846154 0.25
## d 0.1538462 0.50
predict(tmod2, newdata = dfmattest, force = TRUE, type = "prob")
## Warning: 1 feature in newdata not used in prediction.
## black white
## d4 0.6173551 0.3826449
The warning message is telling us that the test set feature c
was not used in predicting the outcome, since it was absent from the training set.
So the question to ask, do you want the absence of a feature to be considered as informative? For the default multinomial Naive Bayes, the absence can be modelled through smoothing, if you split after forming the dfm from all features, or it can be ignored if you split first and create each dfm separately. The answer depends on how you want to treat zeros and what they mean in your problem. It also depends in part on your model, since (for instance) with Bernoulli Naive Bayes, a zero is considered informative.

- 14,454
- 27
- 50
-
Thank you very much for your detailed answer. Quanteda is relatively easy to implement and it is super fast. However, I am not sure how to control parameter tuning in quanteda. The nb model performance is also different from the one I have got using tm and e1071. – user115916 Sep 17 '20 at 13:43
-
Thanks. There's no parameter tuning in Naive Bayes, just the options supplied. See https://stackoverflow.com/questions/54427001/naive-bayes-in-quanteda-vs-caret-wildly-different-results for why e1071 is different. – Ken Benoit Sep 17 '20 at 15:41
-
1Thanks for your great talk at the "Why R ?" conference. I used the regularized regression methods to classify my text data. Ridge and Lasso models worked very well in predicting my test data set. I also tried a cross-validated elastic net while tuning the lambda and alpha parameters. I used the following coded and the model worked very well in finding the bestTune. But, the predict function failed. – user115916 Oct 07 '20 at 08:40
-
I used the following code # Fit elastic net regression `registerDoMC(cores=2) set.seed(223)` `cv_glmnet <- train(x = dfmat_train, y = data_train$Include, method="glmnet",family="binomial", traControl=trainControl(method="cv", number=10), parallel=TRUE, tuneLength=50) ` `predicted_value.elastic <- predict(cv_glmnet, newx= dfmat_matched, s=lambda.min, type="prob")` Error: **Error in cbind2(1, newx) %*% nbeta: invalid class 'NA' to dup_mMatrix_as_dgeMatrix** – user115916 Oct 07 '20 at 08:55