0

HGSV nomenclature has a pattern:

xxxxx.yyyy:charactersnumbercharacters

I would like to make a regex in java and fetch the all the tokens from above eg: it should have 5 tokens : { 'xxxxx', 'yyyy', 'characters', 'number' , 'characters'}

I have used simple split methodology to fetch the tokens, but I don't find its an optimal solution:

my current code is :

String hgsv = "BRAF.p:V600E";
String[] tokens = hgsv.split(".");
this.symbol = tokens[0];
String type = tokens[1].split(":")[0];

I would like to use Pattern and Matcher in Java. No idea, how to make regex for the above token.

Any clue how to do that? (even to separate characters, numbers, characters I will be using regex). So why not to use REGEX for entire token.

I found link but this is in Python, I need similar in Java.

virsha
  • 1,140
  • 4
  • 19
  • 40

1 Answers1

1

I think what you're probably looking for is to use capture groups, like this:

String  s = "BRAF.p:V600E";
Pattern p = Pattern.compile("(\\w+)\\.(\\w+):([a-zA-Z]+)(\\d+)([a-zA-Z]+)");
Matcher m = p.matcher(s);
if (m.matches()) {
    String[] parts = {m.group(1),
                      m.group(2),
                      m.group(3),
                      m.group(4),
                      m.group(5)};
    // Prints "[BRAF, p, V, 600, E]"
    System.out.println(Arrays.toString(parts));
} else {
    // The input String is invalid.
}

That's really just a lot like a split, but it's more stable because you're using the pattern to validate the String beforehand.

Note that I have no idea if that is the exact right pattern that you should be using. I don't know the exact details of the HGSV notation you're talking about and your description is actually pretty vague. (What are e.g. xxxxx and yyyy? What are "characters"?) If you link me to some sort of specification or detailed description of this notation I can try to write a regex that's more definitely correct.

Anyhow, my example shows the basic idea. You might also see http://www.regular-expressions.info/brackets.html for more information.

Radiodef
  • 37,180
  • 14
  • 90
  • 125
  • About this { Note that I have no idea if that is the exact right pattern that you should be using. I don't know the exact details of the HGSV notation you're talking about and your description is actually pretty vague } -- Please find this link : http://varnomen.hgvs.org/bg-material/simple/ – virsha Jun 06 '17 at 09:01