Extracting data from XML text

Question

I have a lot of XML data that looks like this:

<contextfile concordance=brown>
<context filename=br-a02 paras=yes>
<p pnum=1>
<s snum=1>
<wf cmd=done pos=NN lemma=committee wnsn=1 lexsn=1:14:00::>Committee</wf>
<wf cmd=done pos=NN lemma=approval wnsn=1 lexsn=1:04:02::>approval</wf>
<wf cmd=ignore pos=IN>of</wf>
<wf cmd=done rdf=person pos=NNP lemma=person wnsn=1 lexsn=1:03:00:: pn=person>Gov._Price_Daniel</wf>
<wf cmd=ignore pos=POS>'s</wf>
<punc>``</punc>
<wf cmd=done pos=JJ lemma=abandoned wnsn=1 lexsn=5:00:00:uninhabited:00>abandoned</wf>
<wf cmd=done pos=NN lemma=property wnsn=1 lexsn=1:21:00::>property</wf>
<punc>''</punc>
<wf cmd=done pos=NN lemma=act wnsn=1 lexsn=1:10:01::>act</wf>
<wf cmd=done pos=VB lemma=seem wnsn=1 lexsn=2:39:00::>seemed</wf>
<wf cmd=done pos=JJ lemma=certain wnsn=4 lexsn=3:00:03::>certain</wf>
<wf cmd=done pos=NN lemma=thursday wnsn=1 lexsn=1:28:00::>Thursday</wf>
<wf cmd=ignore pos=IN>despite</wf>
<wf cmd=ignore pos=DT>the</wf>
<wf cmd=done pos=JJ lemma=adamant wnsn=1 lexsn=5:00:00:inflexible:02>adamant</wf>
<wf cmd=done pos=NN lemma=protest wnsn=1 lexsn=1:10:00::>protests</wf>
<wf cmd=ignore pos=IN>of</wf>
<wf cmd=done pos=NN lemma=texas wnsn=1 lexsn=1:15:00::>Texas</wf>
<wf cmd=done pos=NN lemma=banker wnsn=1 lexsn=1:18:00::>bankers</wf>
<punc>.</punc>
</s>
</p>

From this I need to extract words just before </wf> to get the output:

Committee approval of Gov. Price Daniel's `` abandoned property '' act seemed certain Thursday despite the adamant protests of Texas bankers .

I have never worked with xml text before so i am a little clueless.

I tried to extract this with some example xml code I found online but got an error saying : Open quote is expected for attribute "concordance" associated with an element type "contextfile". All the files I want to parse start with:

<contextfile concordance=brown>
<context filename=br-a02 paras=yes>

But subsequent data within the file starts with :

<p pnum=2>
<s snum=2>

........
</s>
</p>

maybe this answer helps: http://stackoverflow.com/questions/10890323/using-sax-with-jaxbcontext — Timothy Truckle, Feb 06 '17 at 07:11
Possible duplicate of [How to read XML using XPath in Java](http://stackoverflow.com/questions/2811001/how-to-read-xml-using-xpath-in-java) — Timothy Truckle, Feb 06 '17 at 07:12
@TimothyTruckle Maybe the questions are similar but without the presence of a tag how do I extract the words? — serendipity, Feb 06 '17 at 07:29
*"I tried to extract this with some example xml code I found online but got an error saying : Open quote is expected for attribute "concordance" associated with an element type "contextfile""* Your input XML is not *wellformed* Talk to the one who generated it to keep up with the XML specification. — Timothy Truckle, Feb 06 '17 at 07:35
@TimothyTruckle This data is from the Brown corpus for natural language processing, not something generated internally. — serendipity, Feb 06 '17 at 07:39
*"This data is from the Brown corpus for natural language processing"* ask them if they have a XSD to define the XML structure. With this you can use JaxB to generate access classes: http://stackoverflow.com/questions/11463231/how-to-generate-jaxb-classes-from-xsd — Timothy Truckle, Feb 06 '17 at 07:43
Your xml is invalid. Attribute values such as `concordance=brown` should be quoted like : `concordance="brown"`. Fix your xml first. — SomeDude, Feb 06 '17 at 19:21

kieraf · Answer 1 · 2017-02-07T16:00:19.530

0

This seems to work:

//s[@snum]/string-join(wf | punc, " ")

I verified it on http://xpather.com/Va3jRPr4 (xpath online tester) by using your example "p" tag's content twice. You can drag the horizontal bar down a bit to see the whole result.

edited Feb 07 '17 at 16:00

answered Feb 06 '17 at 21:46

kieraf

96
3

Extracting data from XML text

1 Answers1