Could someone please explain these gsub arguments precisely?

Question

I have this code for trucating strings after an underscore "_" is found, but I don't understand the operators/arguments that were passed through gsub to make this manipulation possible. In particular, why I should have to gsub "\\1" instead of "". I do note that the output of gsubbing nothing removes the entire string. I am also a bit confused by how the operators are being used, particularly parantheses and brackets:

AAA <- "ATGAS_1121"
(aa <- gsub("([^_]*).*", "\\1", AAA))
## [1] "ATGAS"

Please note, this post draws heavily from: R remove part of string

Thanks, I appreciate it.

[`"([^_]*).*"`](http://rick.measham.id.au/paste/explain.pl?regex=%28%5B%5E_%5D*%29.*) — rawr, Feb 24 '15 at 16:29

score 7 · Accepted Answer · edited Feb 24 '15 at 16:44

In regex (..) called capturing group which captures all the characters matched by the pattern present inside that group. You could refer those characters by back-referencing the group index number.

gsub("([^_]*).*", "\\1", AAA)

([^_]*) captures all the characters at the start but not of _ zero or more times. Following .* matches all the remaining characters. gsub will replace all the matched characters with the chars in the replacement part. If your code is like,

gsub("([^_]*).*", "", AAA)

it would remove all the characters, since we matched all the characters but captured only those characters(not of _ symbol) which are present at the start. So by replacing the matched characters with the chars present inside the group index 1, will give you the part before _ symbol.

You could achieve the same result using \K

> gsub("[^_]*\\K.*", "", AAA, perl = TRUE)
[1] "ATGAS"

Since \K is a PCRE feature, you must need to enable perl=TRUE parameter. \K keeps the text matched so far out of the overall regex match.

Or you could just do `gsub("_.*", "", AAA)` – David Arenburg Feb 24 '15 at 16:45 — David Arenburg, Feb 24 '15 at 16:45
i think `sub` would be enough for all. – Avinash Raj Feb 24 '15 at 16:46 — Avinash Raj, Feb 24 '15 at 16:46

score 2 · Answer 2 · edited Jun 20 '20 at 09:12

Why I should have to gsub \\1 instead of ""

A back-reference tells the engine to match the characters that were captured by a capturing group. A capturing group can be created by placing the characters to be grouped inside a set of parenthesis, ( ... ). Every set of capturing parentheses from left to right gets assigned a number, whether or not the engine uses these parentheses when it evaluates the match.

In this case you need to use the back-reference \1 inside of the replacement call to assign the characters that were matched by Group 1 into the new string aa. By using "" instead, you're assigning aa an empty value since the regular expression pattern matches the entire string.

I am also a bit confused by how the operators are being used ... brackets

The square brackets [ ... ] you're asking about are called a character class which defines a set of characters. Saying — "match one of the characters specified by the class".

How I would recommend doing this:

In this example, a regular expression is not needed at all, you can simply split the string.

AAA <- 'ATGAS_1121'
strsplit(AAA, '_', fixed=T)[[1]][1]
# [1] "ATGAS"

And if you insist on using regular expression, you can use sub as follows instead:

AAA <- 'ATGAS_1121'
sub('_.*', '', AAA)
# [1] "ATGAS"

Could someone please explain these gsub arguments precisely?

2 Answers2