How to remove words that start with digits from tokens?

Question

How to remove words that start with digits from tokens in quanteda? Sample words: 21st, 80s, 8th, 5k, but they can be completely different and I don't know them in advance.

I have a data frame with english sentences. I transformed it to corpus by using quanteda. Next I transformed corpus to tokens and I did some cleaning like remove_punct, remove_symbols, remove_numbers, etc. However, the remove_numbers function does not delete words that start with digits. I would like to delete such words, but I don't know their exact form - it can be e.g. 21st, 22nd, etc.

library("quanteda")

data = data.frame(
  text = c("R is free software and 2k comes with ABSOLUTELY NO WARRANTY.",
           "You are welcome to redistribute it under 80s certain conditions.",
           "Type 'license()' or 21st 'licence()' for distribution details.",
           "R is a collaborative 6th project with many contributors.",
           "Type 'contributors()' for more information and",
           "'citation()' on how to cite R or R packages in publications."),
  stringsAsFactors = FALSE
)

corp = corpus(data, text_field = "text")
toks = tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE,
              remove_separators = TRUE, split_hyphens = TRUE)
dfmat = dfm(toks, tolower = TRUE, stem = TRUE, remove = stopwords("english"))

rj-nirbhay · Answer 1 · 2020-05-03T18:53:53.300

This type of problem requires finding the pattern. Here is a solution using gsub:

text = c("R is free software and 2k comes with ABSOLUTELY NO WARRANTY.",
           "You are welcome to redistribute it under 80s certain conditions.",
           "Type 'license()' or 21st 'licence()' for distribution details.",
           "R is a collaborative 6th project with many contributors.",
           "Type 'contributors()' for more information and",
           "'citation()' on how to cite R or R packages in publications.")

text1<-gsub("[0-9]+[a-z]{2}","",text)
# 
# [1] "R is free software and 2k comes with ABSOLUTELY NO WARRANTY."     "You are welcome to redistribute it under 80s certain conditions."
# [3] "Type 'license()' or  'licence()' for distribution details."       "R is a collaborative  project with many contributors."           
# [5] "Type 'contributors()' for more information and"                   "'citation()' on how to cite R or R packages in publications."

You can refer below question for details:

How do I deal with special characters like \^$.?*|+()[{ in my regex?

https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf

Both are good solutions (Nirbhay & Francesco) - prefer the simpler gsub, which can then be passed through original poster's code... — aiatay7n, May 03 '20 at 19:15
I like Nirbhay’s solution too. Though, with mine’s you don’t move away from quanteda which I assumed is what the poster wanted. Please, @nukubiho flag what you consider the right answer so that we can close this thread. — Francesco Grossetti, May 04 '20 at 05:51
I like Francesco's answer more because it uses the `quanteda` function. Adding `pattern = "[0-9]+[a-z]", valuetype = "regex"` argument solves my problem. Thank you both. — nukubiho, May 04 '20 at 07:51

score 2 · Accepted Answer · answered May 03 '20 at 19:09

You just need to delete them explicitly since they are not managed by remove_numbers = TRUE. Just use a simple regular expression which looks for some digits before a character. In the example below, I look for a sequence of digits between 1 and 5 (e.g. (?<=\\d{1,5}). You can adjust the two numbers to fine tune your regular expression.

Here is the example which only uses quanteda but adds tokens_remove() explicitly.

library("quanteda")
#> Package version: 2.0.0
#> Parallel computing: 2 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
#> 
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#> 
#>     View

data = data.frame(
  text = c("R is free software and 2k comes with ABSOLUTELY NO WARRANTY.",
           "You are welcome to redistribute it under 80s certain conditions.",
           "Type 'license()' or 21st 'licence()' for distribution details.",
           "R is a collaborative 6th project with many contributors.",
           "Type 'contributors()' for more information and",
           "'citation()' on how to cite R or R packages in publications."),
  stringsAsFactors = FALSE
)

corp = corpus(data, text_field = "text")
toks = tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE,
              remove_separators = TRUE, split_hyphens = TRUE)
toks = tokens_remove(toks, pattern = "(?<=\\d{1,5})\\w+", valuetype = "regex" )
dfmat = dfm(toks, tolower = TRUE, stem = TRUE, remove = stopwords("english"))

^{Created on 2020-05-03 by the reprex package (v0.3.0)}

How to remove words that start with digits from tokens?

2 Answers2