0

I'm looking to write a grep function to find which lines of a text have a number of ANY format in it.

[exs of formats: (156),(1.67),(1,467),($1,654.00), (one thousand two hundred and sixty), (Two Hundred Six), roman numerals such as MCCXXXIV. ]

** I am assuming that if "I" is by itself it is the english word and not the Roman Numeral**

oguz ismail
  • 1
  • 16
  • 47
  • 69
elm774
  • 41
  • 5
  • You should include line 4, "I also like colors..etc" because I will be treated as roman numeral I. Just some suggestion to tighten up the definition. – user2332849 Mar 20 '20 at 03:42
  • 2
    `[0-9]` is an easy for the Arabic numerals. Pretty easy to add in standard number words, though you'll need to bound it at some point. Roman numerals seems like a job for a model, not regex, because it requires context. `I` is a Roman numeral and a word, and there's no way to distinguish between them without some sense of meaning/parts of speech, which isn't a job for regex, unless you're willing to both make some strong assumptions and tolerate a lot of [ambiguous cases](https://www.wordnik.com/lists/words-made-of-roman-numerals) – Gregor Thomas Mar 20 '20 at 03:46
  • 1
    And if you do want to go that route, then [here is my suggested duplicate](https://stackoverflow.com/q/267399/903061). – Gregor Thomas Mar 20 '20 at 03:49
  • I see your "I by itself is the word, not the numeral", but what about mix? What about LI, Lithium or 51? xi - Greek letter or Roman numeral? MMM: 3000, or yummy? As long as you're okay with consistently erring one way or the other, then I think you can use my suggested duplicate. – Gregor Thomas Mar 20 '20 at 03:52
  • Thank you! As for the duplicate, are you referring to the most voted answer in that thread because I don't see how it can cover decimals or numbers with a comma in them. Unless I'm understanding it wrong – elm774 Mar 20 '20 at 04:04
  • 1
    I mean, combining regex patterns isn't hard. You can use "OR" operators either within the regex pattern or you can run `grepl` once for Arabic numerals, once for number words, and once for Roman numerals and use R's OR operator to combine the results. (The second way is probably less efficient, but simpler to debug). As to "numbers with commas", since all you want to do is *detect*, not *extract* or *replace*, it shouldn't matter whether a number has commas or decimals or anything---the presence of any single digit [0-9] is sufficient. – Gregor Thomas Mar 20 '20 at 04:22

1 Answers1

0

Here's a working soution, based off this answer, which is the one I could most easily get to work with word boundaries.

I'll leave it to you to decide how high you want the number words to go... note that, as written, I use word boundaries with the number words to prevent matching words like "none" or "bitten", which contain number words. The downside is that while it will match "twenty one" and "twenty-one", it will not match "twentyone".

I filled out the examples a little bit to illustrate.

detect_arabic_numerals = function(x) grepl("[0-9]", x)
detect_roman_numerals = function(x) {
  x = gsub("\\bI\\b", "", x, ignore.case = TRUE) # Prevent lone I matches
  grepl("\\b(M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))\\b", x, ignore.case = TRUE)
}

detect_number_words = function(x) {
  number_words = c(
    "one",
    "two", 
    "three",
    "four",
    "five", 
    "six",
    "seven",
    "eight",
    "nine",
    "ten", 
    "eleven",
    "twelve",
    "thirteen",
    "fourteen",
    "fifteen", 
    "sixteen",
    "seventeen",
    "eighteen",
    "nineteen", 
    "twenty",
    "thirty", 
    "forty",
    "fifty",
    "sixty", 
    "seventy",
    "eighty",
    "ninety",
    "hundred",
    "thousand",
    "million"
  )
  grepl(paste("\\b", number_words, "\\b", collapse = "|", sep = ""), x, ignore.case = TRUE)
}
detect_numbers = function(x) {
  detect_arabic_numerals(x) | detect_number_words(x) | detect_roman_numerals(x)
}

stuff<-c("Examples of numbers are one and  two, 3, 1,284 and fifty nine.",
         "Do you have any lucky numbers?",
         "Roman numerals such as XIII and viii are my favorites.", 
         "I also like colors such as blue, green and yellow.",
         "This ice pop costs $1.48.",
         "Extra case none match",
         "But please match this one",
         "Even hyphenated forty-five",
         "Wish to match fortyfive")
stuff[detect_numbers(stuff)]
# [1] "Examples of numbers are one and  two, 3, 1,284 and fifty nine."
# [2] "Roman numerals such as XIII and viii are my favorites."        
# [3] "This ice pop costs $1.48."                                     
# [4] "But please match this one"                                     
# [5] "Even hyphenated forty-five"

It's not perfect---the problem I just noticed is that, because punctuation is counted as a word-boundary, contractions where the suffix is a valid Roman numeral like "I'll" or "We'd" will match as Roman numerals. You could potentially remove punctuation as a pre-process step inside detect_roman_numerals, much like I already pre-process to remove the lone "I"s.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • THANK YOU SO MUCH! This approach was brilliant and beyond helpful!! I knew I was going to need to make some sacrifices, especially when it came to the Roman Numerals and I'm fine with that. I think I'll add another line to prevent line II matches because in my text i'm not anticipating too many roman numerals. – elm774 Mar 20 '20 at 05:23