10

I'm trying to write a regular expression using the PCRE library in PHP.

I need a regex to match only &, > and < chars that exist within string part of any XML node and not the tag declaration themselves.

Input XML:

<pnode>
  <cnode>This string contains > and < and & chars.</cnode>
</pnode>

The idea is to to a search and replace these chars and convert them to XML entities equivalents.

If I was to convert the entire XML to entities the XML would look like this:

Entire XML converted to entities

&lt;pnode&gt;
  &lt;cnode&gt;This string contains &gt; and &lt; and &amp; chars.&lt;/cnode&gt;
&lt;/pnode&gt;

I need it to look like this:

Correct XML

<pnode>
  <cnode>This string contains &gt; and &lt and &amp; chars.</cnode>
</pnode>

I have tried to write a regular expression to match these chars using look-ahaead but I don't know enough to get this to work. My attempt (currently only attempting to match > symbols):

/>(?=[^<]*<)/g

Just to make it clear the XML I'm trying to fix comes from a 3rd party and they seem unable to fix it their end hence my attempt to fix it.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Camsoft
  • 11,718
  • 19
  • 83
  • 120
  • 2
    @Rowland, while I agree with you, that's exactly his point he wants to take the input and make it into valid XML by escaping the >, < and & characters. – Lazarus Feb 17 '10 at 16:55
  • 1
    Unless you have a schema defined, how could you possibly know that any given < is not the beginning of a tag? – John M Gant Feb 17 '10 at 16:59
  • 3
    Why do you have invalid XML to start with? Is it possible to avoid generating malformed XML rather than try to fix it up after the fact? – John Kugelman Feb 17 '10 at 17:00
  • @Camsoft, have you tried http://regexlib.com as a resource for this kind of thing. It might provide some clues if not the final solution. – Lazarus Feb 17 '10 at 17:02
  • @jmgant, that's a good point. If you assume that the nodes only have either text or child nodes between them then by matching tag pairs you could identify the text that needs the substitutions. – Lazarus Feb 17 '10 at 17:03
  • `s||<![CDATA[|g`, `s||]]>|g`. – kennytm Feb 17 '10 at 17:04
  • @John Kugelman, that's usually my first response and probably the most valid one. Fixing the problem this way is a kludge at best, we should always try to solve the problem at it's source. +1 for that. – Lazarus Feb 17 '10 at 17:05
  • @jmgant Indeed. There is no schema with this so called XML feed. It's worth noting that I get the XML feed from a 3rd party and have no control over it's data. I was thinking it might be possibly to write a crude regex that when it finds a matching char it would make sure that a tag before it and after exists of the same name (i.e. enclosed) – Camsoft Feb 17 '10 at 17:08
  • @Lazarus Thanks for that, I'm looking in to it now. – Camsoft Feb 17 '10 at 17:17
  • 1
    @Camsoft, "It's worth noting that I get the XML feed from a 3rd party and have no control over it's data." No, you get a data feed. It's not an XML feed. If your 3rd party says it is, he's selling defective goods. – LarsH Sep 15 '10 at 18:39

7 Answers7

3

In the end I've opted to use the Tidy library in PHP. The code I used is shown below:

  // Specify configuration
  $config = array(
    'input-xml'  => true,
    'show-warnings' => false,
    'numeric-entities' => true,
    'output-xml' => true);

  $tidy = new tidy();
  $tidy->parseFile('feed.xml', $config, 'latin1');
  $tidy->cleanRepair()

This works perfectly correcting all the encoding errors and converting invalid characters to XML entities.

Camsoft
  • 11,718
  • 19
  • 83
  • 120
  • Don't forget to accept your answer, even though it's you it has answered your question and will save others trolling through the other answers. – Lazarus Feb 18 '10 at 09:51
2

Classic example of garbage in, garbage out. The real solution is to fix the broken XML exporter, but obviously that's out of the scope of your problem. Sounds like you might have to manually parse the XML, run htmlentites() on the contents, then put the XML tags back.

TravisO
  • 9,406
  • 4
  • 36
  • 44
  • or htmlspecialchars() if you just want to convert the mentioned characters. – jeroen Feb 17 '10 at 17:01
  • The XML is provided from a 3rd party and I have no control over the data. Also there are fewer character entities in XML than PHP so htmlentites() would over entitise! ;-) – Camsoft Feb 17 '10 at 17:03
  • Problem with parsing it as an object is that actual XML document I want to fix is 5MB and 42,000 lines. I hoped that a regex would quickly search and replace the invalid chars. – Camsoft Feb 17 '10 at 17:14
2

I'm reasonably certain it's simply not possible. You need something that keeps track of nesting, and there's no way to get a regular expression to track nesting. Your choices are to fix the text first (when you probably can use an RE) or use something that's at least vaguely like an XML parser, specifically to the extent of keeping track of how the tags are nested.

There's a reason XML demands that these characters be escaped though -- without that, you can only guess about whether something is really a tag or not. For example, given something like:

    <tag>Text containing < and > characters</tag>

you and I can probably guess that the result should be: ...containing &lt; and &gt;... but I'm pretty sure the XML specification allows the extra whitespace, so officially "< and >" should be treated as a tag. You could, I suppose, assume that anything that looks like an un-matched tag really isn't intended to be a tag, but that's going to take some work too.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • Yeah I was starting to think that. The more I look at the problem the more complicated it seems to get. I just would love to be able to avoid using a XML parser as its a huge XML file I'm trying to fix. – Camsoft Feb 17 '10 at 17:22
0

Would it be possible to intercept the text before it tries to become part of your XML? A few ounces of prevention might be worth pounds of cure.

No Refunds No Returns
  • 8,092
  • 4
  • 32
  • 43
  • I'm not the author of the XML, I'm just the one trying to use it. – Camsoft Feb 17 '10 at 17:12
  • @Camsoft would you fill up your car with gas that will break the engine? If the answer to that is No, then why do you want to use broken XML? Tell the provider to fix it. – Gordon Jun 03 '11 at 09:51
  • @Gorden, thanks your reply, though this question was asked over a year ago! The provider refused to fix it, clearly I tried but there is not much more I can do other than attempt to fix it myself. – Camsoft Jun 05 '11 at 12:59
0

This should do it for ampersands:

/(\s+)(&)(\s+)/gim

This means you're only looking for those characters when they have whitespace characters on both sides.

Just make sure the replacement expression is "$1$2amp;$3";

The others would go like this, with their replacement expressions on the right

/(\s+)(>)(\s+)/gim   "$1&gt;$2"
/(\s+)(<)(\s+)/gim   "$1&lt;$2"
Robusto
  • 31,447
  • 8
  • 56
  • 77
0

As stated by others, regular expressions don't do well with hierarchical data. Besides, if the data is improperly formatted, you can't guarantee that you'll get it right. Consider:

<xml>
    <tag>Something<br/>Something Else</tag>
</xml>

Is that <br/> supposed to read &lt;br/&gt;? There's no way to know because it's validly formatted XML.

If you have arbitrary data that you wish to include in your XML tree, consider using a <![CDATA[ ... ]]> block instead. It's treated the same as a text node, and the only thing you don't have to escape is the character sequence ]]>.

MightyE
  • 2,679
  • 18
  • 18
0

What you have there is not, of course, XML. In XML, the characters '<' and '&' may not occur (unescaped) inside text: only inside a comment, CDATA section, or processing instruction. Actually, '>' can occur in text, except as part of the string ']]>'. In well-formed XML, literal '<' and '&' characters signal the start of markup: '<' signals the start of a start tag, end tag, or empty element tag, and '&' signals the start of an entity reference. In both these cases, the next character may NOT be whitespace. So using an RE like Robusto's suggestion would find all such occurrences. You might also need to catch corner cases like '<<', '<\', or '&<'. In this case you don't need to try to parse your input, an RE will work fine.

If the source contains strings like '<something ' where 'something' matches the production for a Name:

Name ::= NameStartChar (NameChar)*

Then you have more of a problem. You are going to have to (try to) parse your input as if it were real XML, and detect the error cases of malformed Names, non-matching start & end tags, malformed attributes, and undefined entity references (to name a few). Unfortunately the error condition isn't guaranteed to happen at the location of the error.

Your best bet may be to use an RE to catch 90% of the error and fix the rest manually. You need to look for a '<' or '&' followed by anything other than a NameStartChar

Max
  • 1