0

I have a bunch of XML text that I need to iterate over and extract some data. I know Regex is not the best way to go about it but the data I need to extract is minimal and I was successfully able to do it through Regex. The issue I am facing is I need that data to appear in order. The data below is what I am extracting info from but I need to do it paragraph wise so need to iterate over the pnum=1, pnum=2 .... values that mark the beginning of that particular paragraph. How do I iterate over this using regex? Will regex lookarounds help in this?

First Paragraph:

<p pnum=1>
<s snum=1>
<wf cmd=done pos=NN lemma=committee wnsn=1 lexsn=1:14:00::>Committee</wf>
<wf cmd=done pos=NN lemma=approval wnsn=1 lexsn=1:04:02::>approval</wf>
<wf cmd=ignore pos=IN>of</wf>
<wf cmd=done rdf=person pos=NNP lemma=person wnsn=1 lexsn=1:03:00:: pn=person>Gov._Price_Daniel</wf>
<wf cmd=done pos=NN lemma=banker wnsn=1 lexsn=1:18:00::>bankers</wf>
<punc>.</punc>
</s>
</p>

Second paragraph:

<p pnum=2>
<s snum=2>
<wf cmd=done rdf=person pos=NNP lemma=person wnsn=1 lexsn=1:03:00:: pn=person>Daniel</wf>
<wf cmd=done pos=RB lemma=personally wnsn=1 lexsn=4:02:01::>personally</wf>
<wf cmd=done pos=VB lemma=lead wnsn=7 lexsn=2:41:00::>led</wf>
<punc>.</punc>
</s>
</p>
serendipity
  • 852
  • 13
  • 32
  • 1
    One does not simply parses xml with regex [for these reasons.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Mohammad Yusuf Feb 03 '17 at 06:36
  • 1
    Iterating over XML using regex would be a bit like trying to get around Midtown Manhattan using a snowboard. Could it be done? Yes, possibly, but much better to use an XML parser here. – Tim Biegeleisen Feb 03 '17 at 06:36
  • @MYGz Yes. I have read this post several times. But like I said what I need to extract is minimal so I could do with regex rather than used specialized xml tools and libraries. – serendipity Feb 03 '17 at 06:38
  • @serendipity Okay. Do you want to extract `

    ....

    ` from a larger text?
    – Mohammad Yusuf Feb 03 '17 at 06:40
  • @MYGz Yes..just the

    bit so that whatever I extract from this portion shows up in the same line and the next paragraph's info on the next..so on and so forth.

    – serendipity Feb 03 '17 at 06:43

1 Answers1

1

The key is to use a non-greedy qualifier .*? to only grab the contents of one paragraph at a time

    Pattern p = Pattern.compile("<p pnum=([0-9]+)>.*?</p>", Pattern.DOTALL);
    Matcher m = p.matcher(text);
    while(m.find()) {
        System.out.format("******Paragraph %s*****%n", m.group(1));
        System.out.println(m.group(0));
    }

This will of course fail if there are any nested paragraphs <p>...</p>, which is why regex is not a good choice.

Patrick Parker
  • 4,863
  • 4
  • 19
  • 51
  • Thanks for the answer. All I want is to iterate over the paragraph numbers and a nested for loop extracts info from this para like {group=Miller County Democratic Executive Committee, location=Miller County, person=Gov. Price Daniel, other=The Constitution}. Then move on to the next para and extract info from that. The extracting info bit is done. Just need an outer for loop that iterates over paragraphs. Maybe something like pnum=.*? and then I match m.group(0) to the value of an iterator say i? – serendipity Feb 03 '17 at 07:32
  • @serendipity I answered your question about how to iterate over the paragraphs. That is the purpose of the `while` loop in my code. If you have a new question about how to extract data from a paragraph, I feel that should rightly be a separate question. In short, you will need another Matcher object and another `while` loop inside the `while` loop provided here. – Patrick Parker Feb 03 '17 at 07:42