RegEx/Python: n - occurrences of match before other match

Question

I have a XML file structure like this:

 <word id="15" pos="SS">
          <token>infarto</token>
          <lemmas>infarto</lemmas>
         </word>
         <word id="16" pos="AS">
          <token>miocardico</token>
          <lemmas>miocardico</lemmas>
         </word>
         <word id="17" pos="AS" annotated="head">
          <token>acuto</token>
          <lemmas>acuto</lemmas>
         </word>
         <word id="18" pos="E">
          <token>in</token>
          <lemmas>in</lemmas>
         </word>
         <word id="19" pos="SS">
          <token>corso</token>
          <lemmas>corso</lemmas>
         </word>

What I'm trying to do, is getting the values for "pos" and "token" of the word surrounding the one that has the word id 17 (the annotated = "head" one).

This is no problem for all matches coming after word 17.

(pos=")(.+)(")(\s\S+?)("head")([\s\S]+?)(>)(\w+?)(<+)([\S\s]+?)(pos=")(.+)(")([\s\S]+?)    (token>)(.+)(<)([\s\S]+?)

This gets me all the information I want and if I want to expand I can just add

(pos=")(.+)(")([\s\S]+?)(token>)(.+)(<)([\s\S]+?)

to the end. It isn't pretty, but it works.

Now when I go want to go into the other direction, I'm absolutely stumped

(pos=")(.+)(")([\s\S]+?)(token>)(.+)(<)([\s\S]+?)(pos=")(.+)(")(\s\S+?)("head")

Instead of matching only the information of word 16 ( the first in front of "annotated head"), it matches all the information that comes before (word 15, word 14, word 13, etc).

What am I missing?

P.S. Using an XML parser is sadly not an option.

You should be using an XML library for this type of task, not regexes. — armandino, Aug 07 '12 at 09:21
you should not be using regular expressions for html or xml. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Inbar Rose, Aug 07 '12 at 09:22
http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la — Dmitry Zagorulkin, Aug 07 '12 at 09:24
[Use an XML parser](http://stackoverflow.com/a/1732454/647772) — , Aug 07 '12 at 09:35
Yes, I am well aware of that. Sadly using one of those is really not an option at the moment. :( — lhausmann, Aug 07 '12 at 10:30

score 0 · Answer 1 · answered Aug 07 '12 at 12:25

If you have made sure your data is well-formed XML. I think it is possible, try with these steps:

step1: <word[^>]*>([^<]*(?:(?!<\/?word)<[^<]*)*)<\/word> (ref http://regexr.com?31org)
step2: get string from step 1(group 1), and match with <token[^>]*>([^<]*(?:(?!<\/?token)<[^<]*)*)<\/token> (ref http://regexr.com?31ora) or <lemmas[^>]*>([^<]*(?:(?!<\/?lemmas)<[^<]*)*)<\/lemmas> (ref http://regexr.com?31ord)

You could try to modify these patterns for your requirement :)

Reference: Mastering Regular Expression 3rd

score 0 · Accepted Answer · answered Aug 07 '12 at 18:04

I think it should be something like that:

(?s)(<word(?:(?!<word).)*)<word[^>]*?annotated="head".*?(<word[^>](?:(?<!</word>).)*)

As a result, group #1 will contains node "word" with id = 16 and group #2 will contain node "word" with id = 18.

Then you can parse each of these nodes separately using regex like the following:

(?s)<word[^>]*?pos="(?P<pos>[^"]+).*?<token>(?P<token>[^<]+)

and you will get two groups 'pos' and 'token'.

Of course a single regex can be used but it will be pretty ugly.

RegEx/Python: n - occurrences of match before other match

2 Answers2