1

I'm in a real hurry right now, and I'm begging REGEX masters for help! I'm receiving an XML trough a HTTP request, and I just can't parse it since it contains some special chars not being wrapped in CDATA sections.

example XML:

<root>
    <node>good node</node>
    <node>bad node containing &</node>
<root>

Trying to parse this XML with simplexml_load_string($xml) I get:

Warning: simplexml_load_string() [function.simplexml-load-string]:
Entity: line 3: parser error : xmlParseEntityRef: no name in /..../file.php on line ##

Supposing that the bad nodes will not contain > or <, I need a REGEX that will wrap the text in that nodes in CDATA sections. I guess there will be some lookarounds, I just can't do it quickly.

Thank you!

s3v3n
  • 8,203
  • 5
  • 42
  • 56
  • Easy: `$result = "<![CDATA[" . $get_file_contents() . "]]>";` No need for a regex! – Kerrek SB Nov 17 '11 at 15:27
  • So, you don't have any way to get that "XML" (read: "INVALID XML") to have encoded entities? – Code Jockey Nov 17 '11 at 15:29
  • Unfortunatly I have no access to that computers, so I can't do anything to get it right for the moment – s3v3n Nov 17 '11 at 15:37
  • @Kerrek: I should wrap the contents of each terminal non-empty node. Your solution will return me the entire xml tree as text - impossible to parse – s3v3n Nov 17 '11 at 15:39
  • @s3v3n: Would it be an option *only* to find stray ampersands and replace them by an entity reference? – Kerrek SB Nov 17 '11 at 15:49
  • @KerrekSB certainly seems feasible from what I read, and would remedy the mixed-ampersands-and-entities situation that might arise at certain times... – Code Jockey Nov 17 '11 at 15:53
  • @KerrekSB The problem is I'm not sure the only special chars will be the ampersands, so I preferred a REGEX solution. The ampersand was only an example and was the first problem I met. – s3v3n Nov 17 '11 at 16:32
  • @s3v3n: In that case I don't think you'll get a truly correct solution unless you run some sort of validating parser that can recover from errors and handle the erroneous section. – Kerrek SB Nov 17 '11 at 16:36
  • Or accept it as a temporary solution, and ask that guys to give me as quickly as possible a valid XML. :) Thanks anyway! – s3v3n Nov 17 '11 at 16:48

1 Answers1

2

If you can indeed assume that there will be no < or > characters inside the nodes you want to CDATA-ize, then this should work just fine for your situation:

>(?=[^<&]*&)([^<]*)<

replacing with

<!CDATA[\1]]>

This expression only looks for nodes that contain & characters (whether or not they are part of HTML entities), then wraps the contents of those nodes in a CDATA tag, if you need to ignore & characters inside entities, that's a considerable bit tougher, but I'd be willing to give it a look.

Code Jockey
  • 6,611
  • 6
  • 33
  • 45
  • 3
    [A portal to another world will open and from it horrors whose names could not be spoken shall spew forth](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). But a pretty nice solution nonetheless. – G_H Nov 17 '11 at 15:54
  • it might work - mostly - kinda - with a few exceptions I haven't identified yet... :D – Code Jockey Nov 17 '11 at 16:01
  • Hi! Thanks for your help! I modified it a little and it worked with a `+` instead of the first `*` because it was matching `[the blanks or nothing right here]`. It's not the perfect solution (as @G_H pointed by providing "the proof"), but it worked for my particular case. I will kindly ask that guys to give me a better, valid XML. Thanks for your help! – s3v3n Nov 17 '11 at 16:13
  • @s3v3n I never say never... No matter how much people scream that something is an anti-pattern or the worst possible solution, you can almost always come up with some case where it's actually right. If this works for you, say, 99.9% of the time and that's good enough, why not? But you should certainly demand that valid XML is provided. – G_H Nov 17 '11 at 16:16