3

In R, what is the best way of finding dots flanked by asterisks and replace them with asterisks?

input:

"AG**...**GG*.*.G.*C.C"

desired output:

"AG*******GG***.G.*C.C"

I tried the following function, but it is not elegant to say the least.

    library(stringr)

    replac <- function(my_string) {

        m <- str_locate_all(my_string, "\\*\\.+\\*")[[1]]

        if (nrow(m) == 0) return(my_string)

        split_s <- unlist(str_split(my_string, "")) 

        for (i in 1:nrow(m)) {
            st <- m[i, 1]
            en <- m[i, 2] 
            split_s[st:en] <- rep("*", length(st:en))
        }

        paste(split_s, collapse = "")
    }
  • I've have edited the input string and expected output after @TheForthBird answer below to make clear that dots not flanked by asterisks should not be changed, and that other letters other and "A" and "G" may occur.
Vitor
  • 75
  • 6
  • 1
    I have updated it. Do you mean like this matching 1+ uppercase characters `(?:[A-Z]+\*+|\G(?!^))\K\.(?=[^*]*\*)` instead? https://regex101.com/r/DPt2y0/1 – The fourth bird Aug 25 '19 at 16:42

2 Answers2

3

You might use gsub with perl = TRUE and make use of the \G anchor to assert the position at the end of the previous match.

You could match AG or GG using a character class [AG]G or [A-Z]+ to match 1+ uppercase characters.

In the replacement use *

(?:[A-Z]+\*+|\G(?!^))\K\.(?=[^*]*\*)

That will match

  • (?: Non capturing group
  • [A-Z]+*+Match 1+ times uppercase char A-Z, then 1+ times*`
    • | Or
    • \G(?!^) Assert position at the end of previous match, not at the start
  • ) Close non capturing group
  • \K Forget what is currently matched
  • \. Match literally
  • (?= Positive lookahead, assert what is on the right is
    • [^*]*\* Match 0+ times any char except *, then match *
  • ) Close lookahead

Regex demo | R demo

For example:

gsub("(?:[A-Z]+\\*+|\\G(?!^))\\K\\.(?=[^*]*\\*)", "*", "AG**...**GG*.*.G.*C.C", perl = TRUE)

Result

[1] "AG*******GG***.G.*C.C"
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • Actually, this only works for this specific case, if the string is "AG**...**G.G*.*", or "A.G**...**GG*.*" (when there is an occurence of a dot not surrounded by asterisks) it fails. – Ghost Aug 23 '19 at 20:12
  • @Ghost The question states `flanked by asterisks` and in the example data the dots are flanked on the left and right. – The fourth bird Aug 23 '19 at 20:15
  • I'm working on interpreting the reasoning of `\G(?!^))\K\.` - from what I've got, `\G` asserts start of last match (so you can skip leading input that's not asterisks and also jump to the next period), `(?!^)` asserts it's not the start of the string (because the periods wouldn't be flanked by asterisks) and the `\K\.` is so you match one period after the other, one at a time, while still maintaining that they're between asterisks. Is this correct? It's my first time seeing `\G` or `\K` - I learned something new! – Nick Reed Aug 23 '19 at 20:16
  • @NickReed the `\G` anchor starts either at the start of the string or at the end of the previous match. Perhaps this is a [helpful page](https://www.rexegg.com/regex-anchors.html#G) or [this page](https://stackoverflow.com/questions/21971701/when-is-g-useful-application-in-a-regex) or [this page](https://www.regular-expressions.info/continue.html). I think it would help to also check the [debugger](https://regex101.com/r/TJWWwc/1/debugger) how to get the matches. – The fourth bird Aug 23 '19 at 20:19
  • @Thefourthbird Mate, you are not understanding the point of the question.. that string is an example, the title clearly states "Match dots between asterisks", your code doesn't do that, it only reaches the same output IN THIS PARTICULAR EXAMPLE (as would be just doing gsub("\\.","\\*",)). The code the topic opener pasted is in fact the solution, he is looking for a wrapper (which i'm not sure it exists since you will have to pass the matching times as an argument). – Ghost Aug 23 '19 at 20:28
  • @Ghost The title is `How to find substrings flanked by a specific character and replace with text of the same length in R?` Your code would replace any dot to an asterix, see https://ideone.com/16nfMZ. The pattern I suggest is based in the input of the OP and the desired output as stated in the question. Note that the pattern can be updated to `(?:[^*]+\*+|\G(?!^))\K\.(?=[^*]*\*)` to extend the match. See https://regex101.com/r/9P30Ok/1. So it does match dots surrounded by an asterix and you don't have to pass an argument for the matching times as that is taken care of by the patttern. – The fourth bird Aug 23 '19 at 20:39
  • @Thefourthbird Your code matches "dots between asterisks" ONLY when there are no other occurrences of dots in the string. Your code ONLY works for THAT string. Forget the example for a second and read the replac function the post opener wrote, your code is NOT doing that. – Ghost Aug 23 '19 at 20:50
  • @Ghost I have added another pattern that would match all dots between asterisks. – The fourth bird Aug 23 '19 at 21:08
  • @Thefourthbird thank you, I've learned a lot with your answer. I've updated the input string and expected output to make clear that letters other than [AG] may occur, and that `.` not flanked by `*` should be kept. Although you had already anticipated that possibility with you broader regex at the end, that regex replaces more dots than it should. For example, dots in `*.G.*` are replaced but they shouldn't be (your 1st regex works in this scenario). Could you take a look please? It seems to me that the non-capturing group should be everything except the sequence `\*\.+\*`. Is this possible? – Vitor Aug 25 '19 at 16:57
  • 1
    @VitorRezendedaCostaAguiar Does this pattern work for you? https://regex101.com/r/DPt2y0/1 – The fourth bird Aug 25 '19 at 17:03
  • @Thefourthbird Yes it seems to work. Strings can also begin with `*`, and it seems to work if I modify `[A-Z]+` with `[A-Z*]+`. Is that safe? Thank you so much. I do not fully understand your fancy regex, but you solved my problem and gave me a lot to read on! I will accept your answer, but it would be nice if you updated it with this new regex. Thank you a lot! – Vitor Aug 25 '19 at 17:14
  • @Vitor I have updated the answer and the links. If you add `*` to the character class it will match either a range A-Z or `*`. You could do that if you want to match mixed chars. For example https://regex101.com/r/mbHYMq/1 If that fits your requirement you could add it. – The fourth bird Aug 25 '19 at 17:32
  • @Thefourthbird the broader regex is still matching the dots in `*.G.*` when it shouldn't. That seems to be caused by the term `[^.]*` before `\K`. Is there a need for that term? And what is the reason for it? – Vitor Aug 26 '19 at 16:30
  • I benchmarked my `replac` function and the gsub above, and I am surprised. In my dataset of 40,000 strings, the `replac` function takes 3 secs, while the gsub takes 23 secs. The `replac` function involves a lot a splitting, subsetting, and pasting. I certified that both approaches lead to identical outputs. Why is that? – Vitor Aug 26 '19 at 16:38
  • @Vitor This part `[^.]*\K\.` matches 0+ times not a dot, the it uses `\K` to forget the current match and then only matches a `.`. The `\K` part is because you don't want to match what is matched before or else that part will also be replaced. I have added the broader match at the beginning because in the previous comments there were comments about the pattern only matching for the current example. I think using gsub takes longer because the string has to be processed by the regex. Shall I remove the broader match from the answer? – The fourth bird Aug 26 '19 at 17:05
  • 1
    @Thefourthbird The regex `"(?:[A-Z*]+\*+|\G(?!^))\K\.(?=[^*]*\*)` apparently solves my problem. So you can delete the broader option if you want. Thank you. – Vitor Aug 26 '19 at 18:05
1

Try this code, it's still not wrapped, but at least is a bit shorter than yours and works for all the cases, not only the ones without other occurrences of dots in the string:

replac_v2 <- function(my_string){
    b <- my_string #Just a shorter name
    while(TRUE){
        df<-as.data.frame(str_locate(b,"\\*\\.+\\*"))
        add<-as.numeric(df[2]-df[1])+1
        if(is.na(add)){return(b)}
        b<-str_replace(b,"\\*\\.+\\*",paste(rep("*",add),collapse=""))
    }}
Ghost
  • 1,426
  • 5
  • 19
  • 38