20

I have a question about extracting a part of a string. For example I have a string like this:

a <- "DP=26;AN=2;DB=1;AC=1;MQ=56;MZ=0;ST=5:10,7:2;CQ=SYNONYMOUS_CODING;GN=NOC2L;PA=1^1:0.720&2^1:0"

I need to extract everything between GN= and ;.So here it will be NOC2L.

Is that possible?

Note: This is INFO column form VCF file format. GN is Gene Name, so we want to extract gene name from INFO column.

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
Lisann
  • 5,705
  • 14
  • 41
  • 50
  • Question is a little unclear, as it seems your desired string will not always be followed by a semicolon. – jbaums Mar 15 '12 at 14:12

6 Answers6

36

Try this:

sub(".*?GN=(.*?);.*", "\\1", a)
# [1] "NOC2L"
zx8754
  • 52,746
  • 12
  • 114
  • 209
kohske
  • 65,572
  • 8
  • 165
  • 155
15

Assuming semicolons separate your elements, and equals signs occur exclusively between key/value pairs, a non-strictly-regex method would be:

bits <- unlist(strsplit(a, ';'))
do.call(rbind, strsplit(bits, '='))

      [,1] [,2]               
 [1,] "DP" "26"               
 [2,] "AN" "2"                
 [3,] "DB" "1"                
 [4,] "AC" "1"                
 [5,] "MQ" "56"               
 [6,] "MZ" "0"                
 [7,] "ST" "5:10,7:2"         
 [8,] "CQ" "SYNONYMOUS_CODING"
 [9,] "GN" "NOC2L"            
[10,] "PA" "1^1:0.720&2^1:0"  

Then it's just a matter of selecting the appropriate element.

jbaums
  • 27,115
  • 5
  • 79
  • 119
3

One way would be:

gsub(".+=(\\w+);.+", "\\1", a, perl=T)

I am sure there are more elegant ways to do it.

johannes
  • 14,043
  • 5
  • 40
  • 51
3
a <- "DP=26;AN=2;DB=1;AC=1;MQ=56;MZ=0;ST=5:10,7:2;CQ=SYNONYMOUS_CODING;GN=NOC2L;PA=1^1:0.720&2^1:0"
m = regexpr("GN.*;",a)
substr(a,m+3,m+attr(m,"match.length")-2)
Davy Kavanagh
  • 4,809
  • 9
  • 35
  • 50
2

As the string is coming from VCF file, we can use VariantAnnotation package:

library(VariantAnnotation)

# read dummy VCF file
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")

# see first 5 variables for info column
info(vcf)[1:3, 1:5]
# DataFrame with 3 rows and 5 columns
#                  LDAF   AVGPOST       RSQ     ERATE     THETA
#             <numeric> <numeric> <numeric> <numeric> <numeric>
# rs7410291      0.3431    0.9890    0.9856     2e-03    0.0005
# rs147922003    0.0091    0.9963    0.8398     5e-04    0.0011
# rs114143073    0.0098    0.9891    0.5919     7e-04    0.0008

# Now extract one column, e.g.: LDAF
info(vcf)[1:3, "LDAF"]
# [1] 0.3431 0.0091 0.0098

In above example VCF object there is no "GN" column, but the idea is the same, so in your case, below should work:

# extract gene name
info(vcf)[, "GN"]
zx8754
  • 52,746
  • 12
  • 114
  • 209
1

As an alternative to combining back references with sub, you could use a lookbehind and lookahead assertion with an extract operation, like so:

library(stringr)
a <- "DP=26;AN=2;DB=1;AC=1;MQ=56;MZ=0;ST=5:10,7:2;CQ=SYNONYMOUS_CODING;GN=NOC2L;PA=1^1:0.720&2^1:0"
str_extract(a, "(?<=GN=)[^;]*(?=;|$)")
# [1] NOC2L

Where:

  • (?<=GN=) asserts GN= must be ahead of the match
  • (?=;|$) asserts ; or end of string ($) must be behind (after) the match
  • [^;]* matches any number of characters that are not ;

Note: [^;]* was used over .* since the latter could match a ; and continue matching until the end of string ($).

MilesMcBain
  • 1,115
  • 10
  • 12