How to remove words that start with digits from tokens in quanteda? Sample words: 21st, 80s, 8th, 5k, but they can be completely different and I don't know them in advance.
I have a data frame with english sentences. I transformed it to corpus by using quanteda. Next I transformed corpus to tokens and I did some cleaning like remove_punct
, remove_symbols
, remove_numbers
, etc. However, the remove_numbers
function does not delete words that start with digits. I would like to delete such words, but I don't know their exact form - it can be e.g. 21st, 22nd, etc.
library("quanteda")
data = data.frame(
text = c("R is free software and 2k comes with ABSOLUTELY NO WARRANTY.",
"You are welcome to redistribute it under 80s certain conditions.",
"Type 'license()' or 21st 'licence()' for distribution details.",
"R is a collaborative 6th project with many contributors.",
"Type 'contributors()' for more information and",
"'citation()' on how to cite R or R packages in publications."),
stringsAsFactors = FALSE
)
corp = corpus(data, text_field = "text")
toks = tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE,
remove_separators = TRUE, split_hyphens = TRUE)
dfmat = dfm(toks, tolower = TRUE, stem = TRUE, remove = stopwords("english"))