First, (.*)?
is incorrect. It means "zero or more of any characters, zero or one time," which makes no sense. You obviously meant (.*?)
, which means "zero or more of any characters, non-greedily."
The reason it's not matching newlines is (as @Cheery explained) because that's the normal, default behavior. If you want the dot to match anything including newlines, you have to specify single-line mode (also known as DOTALL mode). In PHP you typically do that by adding the /s
flag to the end of the regex (e.g. '/(.*?)/s'
) or by inserting the inline modifier (?s)
at the beginning or the regex (e.g. '/(?s)(.*?)/'
).
There are other valid techniques, too. For example, in JavaScript, which has no single-line/DOTALL mode, most regex authors use [\s\S]
, meaning "any whitespace character or any character that's not whitespace"--in other words, any character.
Often you don't even need to worry about it. For example, in a case like yours you might know that there are no other tags between the pair you're matching, so you could use [^<]
to match any character except <
, because that does include newlines. (But if the XML is malformed as you say, that might not be an option.)
What you should not use is (.|\s)
, which was suggested in another answer. As explained very capably in this answer, this innocent-seeming regex can very easily slow the regex engine to a virtual halt due to the overlap in character sets matched by .
and \s
.
Another "obvious" approach that I often see recommended is (.|\n)
, but that isn't safe either. When we say the dot doesn't match newlines, that doesn't just mean the linefeed character (\n
, U+000A
). Depending on the regex flavor, the compile-time configuration, and the run-time system settings, it can also include carriage-return (\r
, U+000D
), form-feed (\f
, U+000C
), and several other characters (ref). (.|\n)
is also significantly less efficient than the other options, though probably not disastrously so like (.|\s)
.