5

I have a vector

test <- c("NNNCTCGTNNNGTCGTNN", "NNNNNCGTNNNGTCGTGN")

and I want to replace all N in the head of all elements using same length "-". When I use function gsub only replace with one "-".

gsub("^N+", "-", test)
# [1] "-CTCGTNNNGTCGTNN" "-CGTNNNGTCGTGN"  

But I want the result looks like this

# "---CTCGTNNNGTCGTNN", "-----CGTNNNGTCGTGN"

Is there any R function that can do this? Thanks for your patience and advice.

Chao Tang
  • 67
  • 1

2 Answers2

4

You can write:

test <- c("NNNCTCGTNNNGTCGTNN", "NNNNNCGTNNNGTCGTGN", "XNNNNNCGTNNNGTCGTGN")

gsub("\\GN", "-", perl=TRUE, test)

which returns:

"---CTCGTNNNGTCGTNN"  "-----CGTNNNGTCGTGN"  "XNNNNNCGTNNNGTCGTGN"

regex | R code

\G, which is supported by Perl (and by PCRE (PHP), Ruby, Python's PyPI regex engine and others), asserts that the current position is at the beginning of the string for the first match and at the end of the previous match thereafter.

If the string were "NNNCTCGTNNNGTCGTNN" the first three "N"'s would each be matched (and replaced with a hyphen by gsub), then the attempt to match "C" would fail, terminating the match and string replacement.

Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
1

One approach would be to use the stringr functions, which support regex callbacks:

test <- c("NNNCTCGTNNNGTCGTNN", "NNNNNCGTNNNGTCGTGN")
repl <- function(x) { gsub("N", "-", x) }
str_replace_all(test, "^N+", function(m) repl(m))

[1] "---CTCGTNNNGTCGTNN" "-----CGTNNNGTCGTGN"

The strategy here is to first match ^N+ to capture one or more leading N. Then, we pass that match to a callback function which replaces each N with a dash.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360