-1

I have code that detects various patterns in a text of strings (specifically it detects all numbers whether they are in digit form, text form, have decimals, have dollar signs, etc.). I have stored all these patterns in a variable called "nums". (don't worry about the errors in my pattern that is not what i'm focusing on)

nums <- paste(digiNums, dollaCommaNums, dollaDeciNums, textNums, romaNums, sep = "|")
> nums
[1] "(\\d+)|([\\$£]?\\d{1-3}(,\\d{3})+)|([\\$£]?(\\d+)?\\.\\d+)|Zero|One|Two|Three|Four|Five|Six|Seven|Eight|Nine|Ten|Eleven|Twelve|Thirteen|Fourteen|Fifteen|Sixteen|Seventeen|Eighteen|Nineteen|Twenty|Thirty|Fourty|Fifty|Sixty|Seventy|Eighty|Ninty|Hundred|Thousand|Million|Billion|Trillion|\\b(M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))\\b"

linesNums <- grep(nums, lines, value = TRUE)

Now I am trying to modify my text so that it adds highlights (<< >>) to every number detected using my patterns stored in "nums". so the end result would be something like this:

#example text:
I am <<twenty>> years old.
I have <<$50.45>> in my pocket.
This tree is <<100,000>> years old.

How do I accomplish this? when I tried using gsub my end was result was:

linesNums <- cat(gsub(nums, "<<\\1>>", linesNums))

I am <<nums>> years old.
I have <<nums>> in my pocket.
This tree is <<nums>> years old.
Kai
  • 59
  • 5
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Apr 13 '20 at 18:03
  • The `\\1` part will return the first captured match. Your regular expression seems to have a lot of capture groups that you probably aren't interested in individually capturing. Maybe set those to non-capture groups or make sure your have a group around everything you want to consider as a number? – MrFlick Apr 13 '20 at 18:12
  • 1
    Another option is to use `stringr` to help. For example: `stringr::str_replace_all(lines, nums, function(x) {paste0("<<", x, ">>")})` But then you can see there are some problems with your regular expression. – MrFlick Apr 13 '20 at 18:17
  • I ran your code but got the following error: "Error in stri_locate_all_regex(string, pattern, omit_no_match = TRUE, : Error in {min,max} interval. (U_REGEX_BAD_INTERVAL)" do you know what this means? – Kai Apr 13 '20 at 19:05
  • 1
    Oh, You had the interval `{1-3}` in your regular expression but the stringr engine doesn't recognize that as valid. It should be changed to `{1,3}` assuming you meant for the match to occur 1-3 times. – MrFlick Apr 13 '20 at 19:07
  • What i was trying to say with {1-3} is that the number should begin with either 1, 2, or 3 digits. So for example 1,000 (good) 10,000 (good), 100,000(good), 1000,000(bad) – Kai Apr 13 '20 at 19:27
  • 1
    That's what `{1,3}` means. You used it correctly in all the other cases. I'm surprised that worked with `gsub()` because using a dash is not really a valid regular expression (at least not how it's defined on the `?regexp` help page) – MrFlick Apr 13 '20 at 23:50
  • Thank you your solution did the trick – Kai Apr 14 '20 at 17:10

1 Answers1

0

You need to perform multiple substitutions. There is a base R version, and an alternative using stringr.

Note that I had to escape the dollar sign to make this work (edited).

Obviously, you still need to work on your regex patterns.

library(stringr)
nums <- "(\\d+)|([\\$£]?\\d{1-3}(,\\d{3})+)|([\\$£]?(\\d+)?\\.\\d+)|Zero|One|Two|Three|Four|Five|Six|Seven|Eight|Nine|Ten|Eleven|Twelve|Thirteen|Fourteen|Fifteen|Sixteen|Seventeen|Eighteen|Nineteen|Twenty|Thirty|Fourty|Fifty|Sixty|Seventy|Eighty|Ninty|Hundred|Thousand|Million|Billion|Trillion|\\b(M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))\\b"

lines <- c("I am twenty years old.", 
           "I am Twenty years old.", 
           "I have $50.45 in my pocket.",
           "This tree is 100,000 years old.", 
           "This is fine.", "Not sure if that is, two.")


linesNums <- grep(nums, lines, value = TRUE)
rms <- regmatches(linesNums, gregexpr(nums, linesNums))
rms <- unique(unlist(rms))

# alternative stringr function:
str_replace_all(linesNums, 
                setNames(paste0("<<", rms, ">>"), 
                         gsub("$", "\\$", rms, fixed = TRUE)))
#> [1] "<<I>> am twenty years old."             
#> [2] "<<I>> am <<Twenty>> years old."         
#> [3] "<<I>> have <<$50.45>> in my pocket."    
#> [4] "This tree is <<100>>,<<000>> years old."

# base R function:
multisub <- function(target, output, string) {
    replacement.list <- apply(cbind(target, output), 1, as.list)
    mygsub <- function(l, x) gsub(pattern = l[1], replacement = l[2], x, perl=TRUE)
    Reduce(mygsub, replacement.list, init = string, right = TRUE)
}

multisub(gsub("$", "\\$", rms, fixed = TRUE), paste0("<<", rms, ">>"), linesNums)
#> [1] "<<I>> am twenty years old."             
#> [2] "<<I>> am <<Twenty>> years old."         
#> [3] "<<I>> have <<$50.45>> in my pocket."    
#> [4] "This tree is <<100>>,<<000>> years old."

Created on 2020-04-13 by the reprex package (v0.3.0)

user12728748
  • 8,106
  • 2
  • 9
  • 14