0

MetaMap files have following lines:

mappings([map(-1000,[ev(-1000,'C0018017','Objective','Goals',[objective],[inpr],[[[1,1],[1,1],0]],yes,no)])]).

The format is explained as

mappings(
      [map(negated overall score for this mapping, 
            [ev(negated candidate score,'UMLS concept ID','UMLS concept','preferred name for concept - may or may not be different',
                 [matched word or words lowercased that this candidate matches in the phrase - comma separated list],
                 [semantic type(s) - comma separated list],
                 [match map list - see below],candidate involved with head of phrase - yes or no,
                 is this an overmatch - yes or no
               )
            ]
          )
      ]
    ).

I want to run a RegEx query in java that gives me the Strings 'UMLS concept ID', semantic type and match map list. Is RegEx the right tool or what is the most efficent way to accomplish this in Java?

Christian
  • 25,249
  • 40
  • 134
  • 225

3 Answers3

3

Here's my attempt for a regex solution. This replace "meta-regexing" methodology is something I'm experimenting with; I hope it reads to a more readable code.

String line = "mappings([map(-1000,[ev(-1000,'C0018017','Objective','Goals',[objective],[inpr],[[[1,1],[1,1],0]],yes,no)])]).";
String regex = 
    "mappings([map(number,[ev(number,<quoted>,quoted,quoted,[csv],[<csv>],[<matchmap>],yesno,yesno)])])."
    .replaceAll("([\\.\\(\\)\\[\\]])", "\\\\$1") // escape metacharacters
    .replace("<", "(").replace(">", ")") // set up capture groups
    .replace("number", "-?\\d+")
    .replace("quoted", "'[^']*'")
    .replace("yesno", "(?:yes|no)")
    .replace("csv", "[^\\]]*")
    .replace("matchmap", ".*?")
;
System.out.println(regex);
// prints "mappings\(\[map\(-?\d+,\[ev\(-?\d+,('[^']*'),'[^']*','[^']*',\[[^\]]*\],\[([^\]]*)\],\[(.*?)\],(?:yes|no),(?:yes|no)\)\]\)\]\)\."

Matcher m = Pattern.compile(regex).matcher(line);
if (m.find()) {
    System.out.println(m.group(1)); // prints "'C0018017'"
    System.out.println(m.group(2)); // prints "inpr"
    System.out.println(m.group(3)); // prints "[[1,1],[1,1],0]"
}

This replace meta-regexing allows you to accomodate whitespaces between symbols easily by just setting the appropriate replace (instead of sprinkling it all into one unreadable mess).

polygenelubricants
  • 376,812
  • 128
  • 561
  • 623
  • Nice one. Btw: what dream job in Oct? – BalusC Apr 28 '10 at 12:09
  • 1
    I like your meta-regex approach! Until now, I only used named String constants (`String number = "-?\\d+"`) and concatenated these (`...+"[ev("+number+","+...`), but that still resulted in ugly code. – Christian Semrau Apr 28 '10 at 16:27
1

That's a truly hairy format. Regex sounds like the way to go, but you're going to have a truly hairy regex:

mappings\(\[map\(-?[0-9.]+,\[ev\(-?[0-9.]+,'(.*?)','.*?','.*?',\[.*?\],\[(.*?)\],\[(.*)\],(?:yes|no),(?:yes|no)\)\]\)\]\)\.

It gets worse when you have to express the regex as a Java String -- as always, you'll replace every \ with \\. But this should get you what you want; matching groups 1, 2, and 3 are the Strings that you wanted to pull out. Note that I haven't rigorously tested it against malformed input because I haven't the stomach for it. :)

For educational purposes: Despite its appearance, this wasn't actually hard to construct at all -- I just took your sample line and replaced the actual values with the appropriate wildcards, making sure to escape out the parens and brackets and the dot at the end.

Etaoin
  • 8,444
  • 2
  • 28
  • 44
1

It's possible, yes.

Something like (assuming that the values you've quoted are the only places quotes are legal, that the values you've added [] to are the only places those are legal, that '[' and ']' characters can't be present inside values, that the match map list can't have ]] in it apart from at the end. You get the picture -- lots of assumptions . . .)

^[^']+?'([^']*+)'[^\[]+\[[^]]+\],\[([^\]]*?)\],\[\[(.*?)\]\].*$

Which should give you those three fields as the three matched groups (tested on your example with http://www.regexplanet.com/simple/index.html)

Which is-

"^[^']+?'([^']*+)'[^\\[]+\\[[^]]+\\],\\[([^\\]]*?)\\],\\[\\[(.*?)\\]\\].*$"

as a Java string . . .

But that isn't very maintainable. Would probably be better to be a bit more verbose with this one!

Jordan Stewart
  • 674
  • 6
  • 10