In TermDocumentMatrix(), parameter removeNumbers=TRUE
removes Arabic numbers in an English corpus. How can I remove both Roman numerals (such as "iii", "xiv" and "xiii", and in any case) and Arabic numbers?
What custom function can I provide to removeNumbers
parameter to accomplish that?
The code which I am trying to understand and modify:
library(gutenbergr)
library(stringr)
library(dplyr)
library(tidyr)
library(tm)
library(topicmodels)
library(tidyverse)
library(tidytext)
library(slam)
titles = c("Wuthering Heights", "A Tale of Two Cities",
"Alice's Adventures in Wonderland", "The Adventures of Sherlock Holmes")
##read in those books
books = gutenberg_works(title %in% titles) %>%
gutenberg_download(meta_fields = "title") %>%
mutate(document = row_number())
create_chapters = books %>%
group_by(title) %>%
mutate(chapter = cumsum(str_detect(text, regex("\\bchapter\\b", ignore_case = TRUE)))) %>%
ungroup() %>%
filter(chapter > 0) %>%
unite(document, title, chapter)
by_chapter = create_chapters %>%
group_by(document) %>%
summarise(text=paste(text,collapse=' '))
import_corpus = Corpus ( VectorSource (by_chapter$text))
no_romans <- function(s) s[!grepl("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$", toupper(s))]
import_mat = DocumentTermMatrix (import_corpus,
control = list (stemming = TRUE, #create root words
stopwords = TRUE, #remove stop words
minWordLength = 3, #cut out small words
removeNumbers = no_romans, #take out the numbers
removePunctuation = TRUE)) #take out punctuation
The following analysis shows that Roman numerals still exist, such as "iii" and "xii".
> st = import_mat$dimnames$Term
> st[grepl("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$", toupper(st))]
[1] "cli" "iii" "mix" "vii" "viii" "xii" "xiii" "xiv"
[9] "xix" "xvi" "xvii" "xviii" "xxi" "xxii" "xxiii" "xxiv"
[17] "xxix" "xxv" "xxvi" "xxvii" "xxviii" "xxx" "xxxi" "xxxii"
[25] "xxxiii" "xxxiv"