0

I have a XML file structure like this:

 <word id="15" pos="SS">
          <token>infarto</token>
          <lemmas>infarto</lemmas>
         </word>
         <word id="16" pos="AS">
          <token>miocardico</token>
          <lemmas>miocardico</lemmas>
         </word>
         <word id="17" pos="AS" annotated="head">
          <token>acuto</token>
          <lemmas>acuto</lemmas>
         </word>
         <word id="18" pos="E">
          <token>in</token>
          <lemmas>in</lemmas>
         </word>
         <word id="19" pos="SS">
          <token>corso</token>
          <lemmas>corso</lemmas>
         </word>

What I'm trying to do, is getting the values for "pos" and "token" of the word surrounding the one that has the word id 17 (the annotated = "head" one).

This is no problem for all matches coming after word 17.

(pos=")(.+)(")(\s\S+?)("head")([\s\S]+?)(>)(\w+?)(<+)([\S\s]+?)(pos=")(.+)(")([\s\S]+?)    (token>)(.+)(<)([\s\S]+?)

This gets me all the information I want and if I want to expand I can just add

(pos=")(.+)(")([\s\S]+?)(token>)(.+)(<)([\s\S]+?)

to the end. It isn't pretty, but it works.

Now when I go want to go into the other direction, I'm absolutely stumped

(pos=")(.+)(")([\s\S]+?)(token>)(.+)(<)([\s\S]+?)(pos=")(.+)(")(\s\S+?)("head")

Instead of matching only the information of word 16 ( the first in front of "annotated head"), it matches all the information that comes before (word 15, word 14, word 13, etc).

What am I missing?

P.S. Using an XML parser is sadly not an option.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
lhausmann
  • 5
  • 1
  • 1
  • 4

2 Answers2

0

If you have made sure your data is well-formed XML. I think it is possible, try with these steps:

step1: <word[^>]*>([^<]*(?:(?!<\/?word)<[^<]*)*)<\/word> (ref http://regexr.com?31org)
step2: get string from step 1(group 1), and match with <token[^>]*>([^<]*(?:(?!<\/?token)<[^<]*)*)<\/token> (ref http://regexr.com?31ora) or <lemmas[^>]*>([^<]*(?:(?!<\/?lemmas)<[^<]*)*)<\/lemmas> (ref http://regexr.com?31ord)

You could try to modify these patterns for your requirement :)

Reference: Mastering Regular Expression 3rd

godspeedlee
  • 672
  • 3
  • 7
0

I think it should be something like that:

(?s)(<word(?:(?!<word).)*)<word[^>]*?annotated="head".*?(<word[^>](?:(?<!</word>).)*)

As a result, group #1 will contains node "word" with id = 16 and group #2 will contain node "word" with id = 18.

Then you can parse each of these nodes separately using regex like the following:

(?s)<word[^>]*?pos="(?P<pos>[^"]+).*?<token>(?P<token>[^<]+)

and you will get two groups 'pos' and 'token'.

Of course a single regex can be used but it will be pretty ugly.

Teddy Bo
  • 679
  • 8
  • 19