0

I have a xml file, and I have to match the char < and > inside the tag and replace them, but I have some difficulties catching them...

The xml is something link this:

<tag>text</tag>
<tag2>3 is > than 2</tag2>
<tag3>But 1 in < than 4</tag3>

I found a solution using this regex

(\s>\s|\s<\s) 

including a whitespace, the character and another whitespace... but how if there aren't the whitespaces?

Edit In fact I need to replace these symbols with &lt; and &gt;... The xml fields are obtained from a third party software that gave away the output xml file like the one I've written above.

I know that the best approach is that when the software reads the data it encodes the < and > as &lt; and &gt; in the xml, but I hoped that there was a way to do it afterwards

dvoran
  • 31
  • 6
  • 1
    Use an XML parser, not regex to do what you want. [Because this.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – brandonscript Jun 09 '15 at 16:20
  • 1
    The snippet `But 1 in < than 4` is not XML, the `<` needs to be escaped as `<`. – Martin Honnen Jun 09 '15 at 16:24
  • What you've posted isn't valid XML - those angle brackets would be encoded as `<` and `>` in actual xml. Are you trying to set your regex in an XSD to limit what is valid for the element? – Dan Field Jun 09 '15 at 16:24
  • OK, you've got a problem. It's not legal XML, so you can't use an XML parser, but it's pretty tough to distinguish the XML angle brackets from the non-XML angle brackets any other way. (That's why XML wants you to escape them, after all.) You ask the question, what if there aren't any whitespaces around the angle brackets? Indeed, what if there's a "<" followed by "title>this is not a title". You're asking us to guess what input you have to deal with, and to guess how you want to deal with it. This is not programming, it is quackery. Have a talk to your data suppliers, and get them to use XML – Michael Kay Jun 09 '15 at 19:35

1 Answers1

0

So basically you are receiving incorrectly formed XML and you want to replace < and > and replace it with &lt; and &gt;

Bad news. It is not possible to do it with regex in a XML generic way. Try building a parser.

Good news. If you introduce some limitations (i.e. if the data you are receiving comply with some requirements), there may be some good solutions.

You need a way to distinguish which symbols are part of the tags, and which symbols are part of the content.

For example, if you consider that tags have only letters and numbers, but no spaces(or other symbols) in between, something like

(?<lt><)(?:(?!\/?[[:alnum:]]*>))|(?:\s[[:alnum:]]*)(?<gt>>)

could probably work. You can play with it in https://regex101.com/r/uF0iR2/2

It is the concatenation | of two queries. The first one is the < but not followed but the rest of a tag. And the second one is the > but prefixed with something that has an space. We could avoid the negative lookahead ?! but then we could end up colliding with the other "query". We cannot do negative look-behind because there cannot be quantifiers.

Finally, unrelated, another possibility for (\s>\s|\s<\s) is (\s[<>]\s)

Gonfva
  • 1,428
  • 16
  • 28