2

InputString: A soldier may have bruises , wounds , marks , dislocations or other Injuries that hurt him .

ExpectedOutput:
bruises
wounds
marks
dislocations
Injuries

Generalized Pattern Tried:

       ".[\s]?(\w+?)"+                 // bruises.
      "(?:(\s)?,(\s)?(\w+?))*"+             // wounds marks dislocations
      "[\s]?(?:or|and) other (\w+).";     // Injuries

The pattern should be able to match other input strings like: A soldier may have bruiser or other injuries that hurt him.

On trying the generalized pattern above, the output is: bruises dislocations Injuries

There is something wrong with the capturing group for "(?:(\s)?,(\s)?(\w+?))*". The capturing group has one more occurences.. but it returns only "dislocations". "marks" and "dislocation: are devoured.

Could you please suggest what should be the right pattern, and where is the mistake? This question comes closest to this question, but that solution didn't help.

Thanks.

Community
  • 1
  • 1
niks
  • 23
  • 1
  • 3
  • 3
    What makes the words `bruises`, `wounds`, `marks`, `dislocations` and `Injuries` different than the other words? The first four words have a comma before or after it, but I don't see how `Injuries` fits into the picture. – Bart Kiers Feb 18 '10 at 09:07
  • I am trying to perform the following task to implement patterns to extract relationships: NP {, NP} * {,} other NP Bruises, wounds, dislocations or other injuries ... hyponym("bruise","injuries"), hyponym("wound","injuries"), hyponym("dislocations", "injuries") So, one could see how "Injuries" fitment has to be satisfied. – niks Feb 18 '10 at 09:31
  • When the capture group is annotated with a quantifier [ie: (foo)*] then you will only get the last match. If you wanted to get all of them then you need to quantifier inside the capture and then you will have to manually parse out the values. As big a fan as I am of regex, I don't think it's appropriate here for any number of reasons... even if you weren't ultimately doing NLP. – PSpeed Feb 18 '10 at 09:48
  • Thanks @PSpeed: You are right, this is the reason. Though, it's inappropriate, there are not options left except java regex. (Is there anything you could suggest?) ""If you wanted to get all of them then you need to quantifier inside the capture"". How should the following regex be modified? (?:(\s)?,(\s)?(\w+?))* – niks Feb 18 '10 at 10:12
  • Well, the quantifier basically covers the whole regex in that case and you might as well use Matcher.find() to step through each match. Also, I'm curious why you have capture groups for the whitespace. If all you are trying to do is find a comma-separated set of words then that's something like: \w+(?:\s*,\s*\w+)* Then don't bother with capture groups and just split the whole match. – PSpeed Feb 18 '10 at 10:27
  • I take this as a solution then. Splitting the whole match seems to be the only way. Thanks a lot. (I am new to Stack overflow, how could choose your answer as the best, since its in comments.) – niks Feb 18 '10 at 10:45
  • I converted it to a real answer... so you can "select it" if you like. – PSpeed Feb 18 '10 at 19:03

3 Answers3

0

Regex in not suited for (natural) language processing. With regex, you can only match well defined patterns. You should really, really abandon the idea of doing this with regex.

You may want to start a new question where you specify what programming language you're using to perform this task and ask for pointers there.

EDIT

PSpeed posted a promising link to a 3rd party library, Gate, that's able to do many language processing tasks. And it's written in Java. I have not used it myself, but looking at the people/institutions working on it, it seems pretty solid.

Community
  • 1
  • 1
Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • I agree with you completely. Perl and Python may be the best when it comes to text processing. but the work is in java. This work on Patterns is a small sub module. So, need to find a solution for this regex problem in java! – niks Feb 18 '10 at 10:03
  • Well, what can I say? There is really no viable way to extract these words from in input string like `A soldier may have bruiser or other injuries that hurt him` using regex. Really. – Bart Kiers Feb 18 '10 at 10:08
  • Note that you don't need Perl or Python for this. Java can do this just as well. Regex simply isn't the right tool for this job. – Bart Kiers Feb 18 '10 at 10:09
  • Thanks for this suggestion. Could you suggest any non-regex java solution please.. – niks Feb 18 '10 at 10:13
  • This is really nice tool like (more than perhaps) UIMA. Thanks!! – niks Feb 18 '10 at 10:43
0

The pattern that works is: \w+(?:\s*,\s*\w+)* and then manually separate CSV There is no other method to do this with Java Regex.

Ideally, Java regex is not suitable for NLP. A useful tool for text mining is: gate.ac.uk
Thanks to Bart K. , and PSpeed.

niks
  • 23
  • 1
  • 3
0

When the capture group is annotated with a quantifier [ie: (foo)*] then you will only get the last match. If you wanted to get all of them then you need to quantifier inside the capture and then you will have to manually parse out the values. As big a fan as I am of regex, I don't think it's appropriate here for any number of reasons... even if you weren't ultimately doing NLP.

How to fix: (?:(\s)?,(\s)?(\w+?))*

Well, the quantifier basically covers the whole regex in that case and you might as well use Matcher.find() to step through each match. Also, I'm curious why you have capture groups for the whitespace. If all you are trying to do is find a comma-separated set of words then that's something like: \w+(?:\s*,\s*\w+)* Then don't bother with capture groups and just split the whole match.

And for anything more complicated re: NLP, GATE is a pretty powerful tool. The learning curve is steep at times but you have a whole industry of science-guys to draw from: http://gate.ac.uk/

PSpeed
  • 3,346
  • 20
  • 12