R extract part of string

Question

I have a question about extracting a part of a string. For example I have a string like this:

a <- "DP=26;AN=2;DB=1;AC=1;MQ=56;MZ=0;ST=5:10,7:2;CQ=SYNONYMOUS_CODING;GN=NOC2L;PA=1^1:0.720&2^1:0"

I need to extract everything between GN= and ;.So here it will be NOC2L.

Is that possible?

Note: This is INFO column form VCF file format. GN is Gene Name, so we want to extract gene name from INFO column.

Question is a little unclear, as it seems your desired string will not always be followed by a semicolon. — jbaums, Mar 15 '12 at 14:12

score 36 · Accepted Answer · edited Jul 31 '18 at 12:57

36

Try this:

sub(".*?GN=(.*?);.*", "\\1", a)
# [1] "NOC2L"

edited Jul 31 '18 at 12:57

zx8754

52,746
12
114
209

answered Mar 15 '12 at 13:53

kohske

65,572
8
165
155

1

Thank Kohske. And what if NOC2L is at the end of the line? then the hole line is selected! – Lisann Mar 15 '12 at 13:58
How is your string exactly? Could you please provide an example? – kohske Mar 15 '12 at 14:03
like this: a <- "DP=26;AN=2;DB=1;AC=1;MQ=56;MZ=0;ST=5:10,7:2;CQ=SYNONYMOUS_CODING;GN=NOC2L – Lisann Mar 15 '12 at 14:04
1

try this: `sub(".*?GN=(.*?)(;.*|$)", "\\1", a)` – kohske Mar 15 '12 at 14:06
Thanks for the question/answer. What if there is no such a thing in "a". In that case, I would like this to return NA. It doesn't in this shape. Any idea? – Rotail Apr 10 '16 at 21:47

jbaums · Answer 2 · 2012-03-15T15:02:49.807

Assuming semicolons separate your elements, and equals signs occur exclusively between key/value pairs, a non-strictly-regex method would be:

bits <- unlist(strsplit(a, ';'))
do.call(rbind, strsplit(bits, '='))

      [,1] [,2]               
 [1,] "DP" "26"               
 [2,] "AN" "2"                
 [3,] "DB" "1"                
 [4,] "AC" "1"                
 [5,] "MQ" "56"               
 [6,] "MZ" "0"                
 [7,] "ST" "5:10,7:2"         
 [8,] "CQ" "SYNONYMOUS_CODING"
 [9,] "GN" "NOC2L"            
[10,] "PA" "1^1:0.720&2^1:0"

Then it's just a matter of selecting the appropriate element.

score 3 · Answer 3 · answered Mar 15 '12 at 13:59

3

One way would be:

gsub(".+=(\\w+);.+", "\\1", a, perl=T)

I am sure there are more elegant ways to do it.

answered Mar 15 '12 at 13:59

johannes

14,043
5
40
51

score 3 · Answer 4 · answered Mar 15 '12 at 14:00

3

a <- "DP=26;AN=2;DB=1;AC=1;MQ=56;MZ=0;ST=5:10,7:2;CQ=SYNONYMOUS_CODING;GN=NOC2L;PA=1^1:0.720&2^1:0"
m = regexpr("GN.*;",a)
substr(a,m+3,m+attr(m,"match.length")-2)

answered Mar 15 '12 at 14:00

Davy Kavanagh

4,809
9
35
50

score 2 · Answer 5 · answered Dec 13 '16 at 08:51

As the string is coming from VCF file, we can use VariantAnnotation package:

library(VariantAnnotation)

# read dummy VCF file
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")

# see first 5 variables for info column
info(vcf)[1:3, 1:5]
# DataFrame with 3 rows and 5 columns
#                  LDAF   AVGPOST       RSQ     ERATE     THETA
#             <numeric> <numeric> <numeric> <numeric> <numeric>
# rs7410291      0.3431    0.9890    0.9856     2e-03    0.0005
# rs147922003    0.0091    0.9963    0.8398     5e-04    0.0011
# rs114143073    0.0098    0.9891    0.5919     7e-04    0.0008

# Now extract one column, e.g.: LDAF
info(vcf)[1:3, "LDAF"]
# [1] 0.3431 0.0091 0.0098

In above example VCF object there is no "GN" column, but the idea is the same, so in your case, below should work:

# extract gene name
info(vcf)[, "GN"]

score 1 · Answer 6 · answered Apr 25 '18 at 12:38

As an alternative to combining back references with sub, you could use a lookbehind and lookahead assertion with an extract operation, like so:

library(stringr)
a <- "DP=26;AN=2;DB=1;AC=1;MQ=56;MZ=0;ST=5:10,7:2;CQ=SYNONYMOUS_CODING;GN=NOC2L;PA=1^1:0.720&2^1:0"
str_extract(a, "(?<=GN=)[^;]*(?=;|$)")
# [1] NOC2L

Where:

(?<=GN=) asserts GN= must be ahead of the match
(?=;|$) asserts ; or end of string ($) must be behind (after) the match
[^;]* matches any number of characters that are not ;

Note: [^;]* was used over .* since the latter could match a ; and continue matching until the end of string ($).

R extract part of string

6 Answers6

Linked