0

I have a long strings taken from a VCF file such as (These are truncated for example purpose):

chr1    11189845    COSM462604;COSM893813   G   C,T 158.16  PASS    AF=0,0;AO=0,0;DP=1201;FAO=0,0;FDP=1201;FR=.;
chr1    11190804    COSM180789  C   T   134.06  PASS    AF=0;AO=0;DP=1016;FAO=0;FDP=1018;FR=.;FRO=1018;

I want to to write a single regex to return all values of FAO on a given line. The valid format for FAO is: FAO=SomeNumber; or FAO=SomeNumber, SomeNumber, SomeNumber, etc...;

Is there a way to write a REGEX capture group that takes into account both a single value and an infinite number of values separated by a comma until you see a ';'?

I've tried

FAO=((([0-9]+);)|(([0-9]+),([0-9])+))

But it only takes into account up to 2 numbers and I need matcher group 1 to be the first value, matcher group 2 to be the second etc...

Mazdak
  • 105,000
  • 18
  • 159
  • 188
Brent
  • 303
  • 3
  • 11
  • What language are you using? – Andy Lester May 06 '15 at 15:18
  • why do you have to do it in a single regex? I'd rather extract the string between `FAO=` and `;` and split it at the `,` tokens. – m.s. May 06 '15 at 15:19
  • Not a language specific problem at this point I just want standard regex to parse. I could write code to do this but that doesn't help as the nature of the application requires as input valid regex and the users the matcher groups for processing. – Brent May 06 '15 at 18:30

2 Answers2

1

you could use a regex like this

FAO=([0-9]+(,[0-9]+)*);

the outer parentheses allow you to extract the value or values with the first matching group.

EDIT

considering that you want to capture the individual values with different matching groups this approach won't work (capturing groups inside * will only capture the last match). see the accepted answer to this question for a solution.

EDIT 2

see this demo based on that answer for an example of a pcre regex that will match each number with the same capturing group.

(?:FAO=|\G,)\K(\d+)

note that not all regex flavours support \G and \K. \G matches the end of the previous match (or the start of the string), and \K resets the start of current match.

Community
  • 1
  • 1
1010
  • 1,779
  • 17
  • 27
1

You can use a negated character class: [^;]+ This says to match any characters that are not a semicolon. Since it's a greedy match it will continue until it sees the first semicolon.

var strings = [
  'chr1    11189845    COSM462604;COSM893813   G   C,T 158.16  PASS    AF=0,0;AO=0,0;DP=1201;FAO=0,0;FDP=1201;FR=.;',
  'chr1    11190804    COSM180789  C   T   134.06  PASS    AF=0;AO=0;DP=1016;FAO=0;FDP=1018;FR=.;FRO=1018;'
];

strings.forEach(function(str) {
  alert(str.match(/(FAO=[^;]+)/)[1]);
});

From there you can edit the group match to only grab the values /FAO=([^;]+)/ and then you can split that value on the comma delimiter.

var strings = [
  'chr1    11189845    COSM462604;COSM893813   G   C,T 158.16  PASS    AF=0,0;AO=0,0;DP=1201;FAO=0,0;FDP=1201;FR=.;',
  'chr1    11190804    COSM180789  C   T   134.06  PASS    AF=0;AO=0;DP=1016;FAO=0;FDP=1018;FR=.;FRO=1018;'
];

strings.forEach(function(str) {
  alert(str.match(/FAO=([^;]+)/)[1].split(','));
});

As stated in this SO answer it's not possible in most languages to have an arbitrary number of group matches.

Community
  • 1
  • 1
Jason Cust
  • 10,743
  • 2
  • 33
  • 45
  • The problem is that I need the matcher groups to match the numbers. I can't use additional code to do this for this application. – Brent May 06 '15 at 18:32
  • @Brent It's not possible in most languages to have an arbitrary number of group matches. See the possible duplicate question that points this out: http://stackoverflow.com/questions/3537878/how-to-capture-an-arbitrary-number-of-groups-in-javascript-regexp – Jason Cust May 06 '15 at 19:54
  • Thanks! I think that answers the problem...its not possible to do what I want :-( – Brent May 08 '15 at 10:38