2

Greetings, I've read some threads about this topic, but actually I was unable to find or to think of a adequate solution (see for example: Regular expression to remove XML tags and their content).

I have an xml tag like that:

<bla_tag size="100"
         diameter="50"
         ratio="0.2"
         path="/user/home/something.pdf">
</bla_tag>

Aim: Having a regular expression that removes everything in between <bla_tag ...> .

Problem: the values like size, etc. change in each of the bla_tags (about 1000 bla-tags in the file).

Failed attempt: I tried it with: <bla_tag .*?> (the ? to make it less greedy...). Result of failure: Only <bla_tag has been marked, but not the content within the entire bracket!

What am I doing wrong - or is it actually possible to solve this problem based on regex (I read somewhere that it would not be possible due to xml property to be an type-2 language, can you confirm that?)

Community
  • 1
  • 1
Daniyal
  • 885
  • 3
  • 16
  • 28
  • 3
    See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – RichieHindle Oct 04 '10 at 13:06
  • On what particular programming/scripting language will you apply this? – Ruel Oct 04 '10 at 13:06
  • Your regex looks ok. How do you use it? – Jens Oct 04 '10 at 14:24
  • Jens, I tried typing in this regex into the search field of kate (editor for linux) - and it does not work, for any reasons. I also tried 'Scream Editor' but it also was marked by failure – Daniyal Oct 04 '10 at 15:57

1 Answers1

5

You want to read RegEx match open tags except XHTML self-contained tags

Seriously.

Use an xml parser. (They're not hard to use, honestly). They generally come in one of two flavours - SAX, and DOM, and you're probably going to prefer SAX.

My favorite parser is expat, but they all each have their little subtleties so it's not always a one-size-fits-all.

Community
  • 1
  • 1
Arafangion
  • 11,517
  • 1
  • 40
  • 72
  • Thanks a lot and apologies for the late response. Specially the Automata/Regex part included in the mentioned link helped me. Due to my theoretical computer science lessons now I can also understand why a xml parser is preferable. – Daniyal Nov 13 '10 at 22:03