0

I need to process a XML file that is not well formed.

I nedd to include <![CDATA[ ... ]]> in the content of some tags. I did something like this:

$pattern = "/<$tagname?>(.*)?<\/$tagname>/"; 
$replacement = "<$tagname><![CDATA[$1]]></$tagname>";

$xml = file_get_contents($inputFilename);
preg_match($pattern, $xml, $match);
echo "\nFirst Ocurrence: " . $match[0]; 

$modifiedXml = preg_replace($pattern, $replacement, $xml);
preg_match($pattern, $modifiedXml, $match);

echo "\nFirst Ocurrence Modified: " . $match[0]; 

It works good, but when my XML node has new lines for example:

<node> foo
bar
</node>

It doesn't work. I've read that I have to put /s but I don't have any idea where do I have to put it in my regex.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
iVela
  • 1,160
  • 1
  • 19
  • 40

4 Answers4

1

I don't have any idea where do I have to put it in my regex.

Here

$pattern = "/<$tagname?>(.*)?<\/$tagname>/s";

ps: . (dot) captures every symbol except of the new line. regexp modifier s 'tells' to capture new line too.

Cheery
  • 16,063
  • 42
  • 57
  • You might want to take a closer look at the placement of that question mark. ;) I know it's like that in the question, but you aren't helping anyone by repeating the error. – Alan Moore Feb 05 '12 at 01:49
  • @AlanMoore I just repeated his regexp as it was working before and OP did not have problems with it. To be correct the question mark at `$tagname?` also is weird. – Cheery Feb 05 '12 at 02:07
  • Doh! I didn't even notice that one. :-/ It looks like it's trying to make the element's name optional in the opening tag while still being mandatory in the closing tag. That's another error I would have fixed (or at least commented on) instead of blindly copying it. – Alan Moore Feb 05 '12 at 08:09
0
$pattern = "/<$tagname>([^\\0]*)?<\/$tagname>/"; 
Marina
  • 491
  • 3
  • 9
0

Just from the looks of it, one thing you can do is replace:

(.*)?

by:

((.|\s)*)?

Of course that question mark is pretty useless (it was so in your sample as well), so you can change that into:

((\s|.)*)

edit: I would like to add that I don't think that this is a neat solution, but one requiring very little change from your starting code.

On another note, this regex has some problems when it comes to xml in general. Realize that it only works properly if there is no more than one "tagname"-tag in the document.

Jasper
  • 11,590
  • 6
  • 38
  • 55
  • Read the question more carefully. He is talking about regexp modifier, not about the representation of the space symbol. – Cheery Feb 05 '12 at 00:10
  • Ah yes, you are right, I mistook the slash for a backslash :S I am leaving this here as it is another way to solve the problem, even if it isn't the way he had been reading about. – Jasper Feb 05 '12 at 00:15
  • Of course if one is not wanting to use `\s` just replacing it with `\n` would accomplish the same thing in a slightly nicer way. – Jasper Feb 05 '12 at 00:17
  • Also, if one were to take that approach, it should be ([.\s]*?) -- that is, using a character class, with non-greedy capture. – Umbrella Feb 05 '12 at 00:30
  • @Umbrella Character class works, but so does my way of doing it. I just tend not to use shorthand character classes inside character classes. – Jasper Feb 05 '12 at 00:42
  • @Umbrella Non-greedy capturing makes the regex better if you don't have nested "tagname"-tags, but in the end you are still limiting yourself to certain cases in xml files, which is why I stuck closer to the original code, which also has that problem. It's a valid remark about the original regex though, as long as you add when it now does and does not work. – Jasper Feb 05 '12 at 00:46
  • yeah, (.|\s)* will match, but it will return every character as its own capture in the $match array instead of the whole string as one capture. – Umbrella Feb 05 '12 at 00:48
  • Sorry Jasper, but that's still not a valid solution. See my answer for details. – Alan Moore Feb 05 '12 at 01:45
0

First, (.*)? is incorrect. It means "zero or more of any characters, zero or one time," which makes no sense. You obviously meant (.*?), which means "zero or more of any characters, non-greedily."

The reason it's not matching newlines is (as @Cheery explained) because that's the normal, default behavior. If you want the dot to match anything including newlines, you have to specify single-line mode (also known as DOTALL mode). In PHP you typically do that by adding the /s flag to the end of the regex (e.g. '/(.*?)/s') or by inserting the inline modifier (?s) at the beginning or the regex (e.g. '/(?s)(.*?)/').

There are other valid techniques, too. For example, in JavaScript, which has no single-line/DOTALL mode, most regex authors use [\s\S], meaning "any whitespace character or any character that's not whitespace"--in other words, any character.

Often you don't even need to worry about it. For example, in a case like yours you might know that there are no other tags between the pair you're matching, so you could use [^<] to match any character except <, because that does include newlines. (But if the XML is malformed as you say, that might not be an option.)

What you should not use is (.|\s), which was suggested in another answer. As explained very capably in this answer, this innocent-seeming regex can very easily slow the regex engine to a virtual halt due to the overlap in character sets matched by . and \s.

Another "obvious" approach that I often see recommended is (.|\n), but that isn't safe either. When we say the dot doesn't match newlines, that doesn't just mean the linefeed character (\n, U+000A). Depending on the regex flavor, the compile-time configuration, and the run-time system settings, it can also include carriage-return (\r, U+000D), form-feed (\f, U+000C), and several other characters (ref). (.|\n) is also significantly less efficient than the other options, though probably not disastrously so like (.|\s).

Community
  • 1
  • 1
Alan Moore
  • 73,866
  • 12
  • 100
  • 156