Here's a working soution, based off this answer, which is the one I could most easily get to work with word boundaries.
I'll leave it to you to decide how high you want the number words to go... note that, as written, I use word boundaries with the number words to prevent matching words like "none" or "bitten", which contain number words. The downside is that while it will match "twenty one" and "twenty-one", it will not match "twentyone".
I filled out the examples a little bit to illustrate.
detect_arabic_numerals = function(x) grepl("[0-9]", x)
detect_roman_numerals = function(x) {
x = gsub("\\bI\\b", "", x, ignore.case = TRUE) # Prevent lone I matches
grepl("\\b(M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))\\b", x, ignore.case = TRUE)
}
detect_number_words = function(x) {
number_words = c(
"one",
"two",
"three",
"four",
"five",
"six",
"seven",
"eight",
"nine",
"ten",
"eleven",
"twelve",
"thirteen",
"fourteen",
"fifteen",
"sixteen",
"seventeen",
"eighteen",
"nineteen",
"twenty",
"thirty",
"forty",
"fifty",
"sixty",
"seventy",
"eighty",
"ninety",
"hundred",
"thousand",
"million"
)
grepl(paste("\\b", number_words, "\\b", collapse = "|", sep = ""), x, ignore.case = TRUE)
}
detect_numbers = function(x) {
detect_arabic_numerals(x) | detect_number_words(x) | detect_roman_numerals(x)
}
stuff<-c("Examples of numbers are one and two, 3, 1,284 and fifty nine.",
"Do you have any lucky numbers?",
"Roman numerals such as XIII and viii are my favorites.",
"I also like colors such as blue, green and yellow.",
"This ice pop costs $1.48.",
"Extra case none match",
"But please match this one",
"Even hyphenated forty-five",
"Wish to match fortyfive")
stuff[detect_numbers(stuff)]
# [1] "Examples of numbers are one and two, 3, 1,284 and fifty nine."
# [2] "Roman numerals such as XIII and viii are my favorites."
# [3] "This ice pop costs $1.48."
# [4] "But please match this one"
# [5] "Even hyphenated forty-five"
It's not perfect---the problem I just noticed is that, because punctuation is counted as a word-boundary, contractions where the suffix is a valid Roman numeral like "I'll" or "We'd" will match as Roman numerals. You could potentially remove punctuation as a pre-process step inside detect_roman_numerals
, much like I already pre-process to remove the lone "I"s.