0

In TermDocumentMatrix(), parameter removeNumbers=TRUE removes Arabic numbers in an English corpus. How can I remove both Roman numerals (such as "iii", "xiv" and "xiii", and in any case) and Arabic numbers? What custom function can I provide to removeNumbers parameter to accomplish that?

The code which I am trying to understand and modify:

library(gutenbergr)
library(stringr)
library(dplyr)
library(tidyr)

library(tm)
library(topicmodels)
library(tidyverse)
library(tidytext)
library(slam)

titles = c("Wuthering Heights", "A Tale of Two Cities",
  "Alice's Adventures in Wonderland", "The Adventures of Sherlock Holmes")

##read in those books
books = gutenberg_works(title %in% titles) %>%
  gutenberg_download(meta_fields = "title") %>% 
  mutate(document = row_number())

create_chapters = books %>% 
  group_by(title) %>%
  mutate(chapter = cumsum(str_detect(text, regex("\\bchapter\\b", ignore_case = TRUE)))) %>% 
  ungroup() %>%
  filter(chapter > 0) %>%
  unite(document, title, chapter) 

by_chapter = create_chapters %>% 
  group_by(document) %>% 
  summarise(text=paste(text,collapse=' '))

import_corpus = Corpus ( VectorSource (by_chapter$text))

no_romans <- function(s) s[!grepl("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$", toupper(s))]

import_mat = DocumentTermMatrix (import_corpus,
  control = list (stemming = TRUE, #create root words
  stopwords = TRUE, #remove stop words
  minWordLength = 3, #cut out small words
  removeNumbers = no_romans, #take out the numbers
  removePunctuation = TRUE)) #take out punctuation

The following analysis shows that Roman numerals still exist, such as "iii" and "xii".

> st = import_mat$dimnames$Term
> st[grepl("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$", toupper(st))]
 [1] "cli"    "iii"    "mix"    "vii"    "viii"   "xii"    "xiii"   "xiv"   
 [9] "xix"    "xvi"    "xvii"   "xviii"  "xxi"    "xxii"   "xxiii"  "xxiv"  
[17] "xxix"   "xxv"    "xxvi"   "xxvii"  "xxviii" "xxx"    "xxxi"   "xxxii" 
[25] "xxxiii" "xxxiv"
Tim
  • 1
  • 141
  • 372
  • 590
  • 2
    Do you mean Roman numerals? If so, the solution depends on how thorough you need to be, what sorts of numbers you expect, if you can expect them to be uppercase, if you want to err of the side of keeping or removing questionable candidates.... On one end of the spectrum you could borrow a relatively simple [regex pattern for detecting Roman Numerals](https://stackoverflow.com/q/267399/903061), on the other end you may need to do some sort of semantic modeling to determine grammatically how something is used. – Gregor Thomas May 25 '20 at 19:36
  • Yes. The regex way is probably what I am hoping for. – Tim May 25 '20 at 19:43
  • 1
    Then I'd propose my link as a suggested duplicate. Do think about whether you want to assume an `i` by itself is a Roman numeral or not. Ditto for other single letters, and "mix". If your Roman numerals are, e.g., often preceded by a word like "Chapter" or "Act", or are being used consistently as bullets so are followed by a `.`, you could probably up your accuracy with some regex adjustments. – Gregor Thomas May 25 '20 at 19:44
  • The link is far from how to write a custom function for the parameter in R though. I am not sure what a custom function is expected to accept and return. I am just hoping to remove all numbers, regardless of their context. I also remove all punctuations. Could you help? – Tim May 25 '20 at 19:46
  • 1
    The link provides a strong regex pattern for identifying roman numerals in text. From there, you could use `stringr::str_replace` or `gsub` to do the actual replacing. – r2evans May 25 '20 at 20:02
  • From the help file you linked yourself, objects of class `TermDocumentMatrix` can be subset just like any other matrix. So why not something like `TDM[!grepl("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$",rownames(TDM),ignore.case = TRUE),]`? – Ian Campbell May 25 '20 at 20:08
  • @IanCampbell Thanks. I have added my code, and am not sure how to modify according to your suggestion. Could you post the complete code? – Tim May 25 '20 at 20:24
  • @r2evans Thanks. I have added my code, and am not sure how to modify according to your suggestion. Could you post the complete code? – Tim May 25 '20 at 20:24

1 Answers1

3

Try these options.

library(tm)
dat <- VCorpus(VectorSource(c("iv. Chapter Four", "I really want to discuss the proper mix of 17 ingredients.", "Nothing to remove here.")))

inspect( DocumentTermMatrix(dat) )
# <<DocumentTermMatrix (documents: 3, terms: 13)>>
# Non-/sparse entries: 13/26
# Sparsity           : 67%
# Maximal term length: 12
# Weighting          : term frequency (tf)
# Sample             :
#     Terms
# Docs chapter discuss four here. ingredients. iv. mix nothing proper really
#    1       1       0    1     0            0   1   0       0      0      0
#    2       0       1    0     0            1   0   1       0      1      1
#    3       0       0    0     1            0   0   0       1      0      0

One of Gregor's cautions -- the word "I" -- does not seem to be there, so we won't worry about that for now. Another of Gregor's cautions was the word "mix", which is both legitimate and roman numerals. A basic function to remove simple/whole roman numerals might be:

no_romans <- function(s) s[!grepl("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$", toupper(s))]
inspect( DocumentTermMatrix(dat, control = list(removeNumbers = no_romans)) )
# <<DocumentTermMatrix (documents: 3, terms: 12)>>
# Non-/sparse entries: 12/24
# Sparsity           : 67%
# Maximal term length: 12
# Weighting          : term frequency (tf)
# Sample             :
#     Terms
# Docs chapter discuss four here. ingredients. iv. nothing proper really remove
#    1       1       0    1     0            0   1       0      0      0      0
#    2       0       1    0     0            1   0       0      1      1      0
#    3       0       0    0     1            0   0       1      0      0      1

That removes "mix" but leaves the "iv.". If you need to remove that, then perhaps

no_romans2 <- function(s) s[!grepl("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})[.]?$", toupper(s))]
inspect( DocumentTermMatrix(dat, control = list(removeNumbers = no_romans2)) )
# <<DocumentTermMatrix (documents: 3, terms: 11)>>
# Non-/sparse entries: 11/22
# Sparsity           : 67%
# Maximal term length: 12
# Weighting          : term frequency (tf)
# Sample             :
#     Terms
# Docs chapter discuss four here. ingredients. nothing proper really remove the
#    1       1       0    1     0            0       0      0      0      0   0
#    2       0       1    0     0            1       0      1      1      0   1
#    3       0       0    0     1            0       1      0      0      1   0

(The only difference is adding [.]? near the end of the regex.)

(BTW: one can use grepl(..., ignore.case=TRUE) to get the same effect as toupper(s) as used here. It is a little slower in small-sample testing, but the effect is the same.)

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thanks. Is your code only to remove Roman numerals? I am hoping to remove Arabic numbers too. – Tim May 25 '20 at 22:07
  • I didn't think it was important, since the example (that had an explicit `17` in a string) already had it filtered out for some reason. I'm not a `tm` user, so ... \*shrug\*. Change it to `no_romans3 <- function(s) s[!grepl("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})[.]?$", toupper(s)) & !grepl("^[0-9]+$", s) ]`. – r2evans May 25 '20 at 22:15
  • Thank you again. I added the code that I am trying to understand and modify. It looks like Roman numerals still exist after applying `no_romans`. – Tim May 25 '20 at 23:52
  • I have shrinked the code to illustrate the problem clearer. – Tim May 26 '20 at 02:34
  • I might be unable to help much more, Newbie. I can help with the basic R stuff, but I'm not a `tm` user so cannot speak to why it appears to be ignoring the function that otherwise works with sample data. – r2evans May 26 '20 at 03:39