Need single regular expression for finding
- double spaces
- tabs
- extra enter (line-break/carriage return)
- line break between tag
- space after/before closing tag
in an XML document.
Need single regular expression for finding
in an XML document.
I am sorry this isn't going to be much of an answer, but perhaps I can help a little bit. Now for steps 1, 2, and 3 it would not be to difficult to match as such
a tab is \t
new lines are \r or \n
white space (a space, tab, or new-line) is \s
so 1 tab:(\t) two spaces = (\s\s) or (\s{2}) and an extra line is generally found by two linebreaks in a row, but sometimes they can be seperated by whitespace so watch out for that... (\r\s*\r)|(\n\s*\n)
to put it all together, steps 1/2/3 are with capturing groups :
(\r\s*\r)|(\n\s*\n)|(\t)|(\s{2})
But what about 4 and 5?
Well they both require being able to backtrack in a regular expression in the event that a < does not have a corresponding >, or is not part of the document structure. This could happen with invalid XML, or in CDATA sections and such. It gets complicated but can be done with recursive regular expressions. However I don't know of a regex library in c++ that supports recursion. I'm sorry, but it would probably be much easier to just parse your string data by hand.
Now some regular expressions dialects are different, so your mileage may vary. For reference though I tend to use http://www.regular-expressions.info/quickstart.html
In general you need an xml parser to process xml documents. Regular expressions are not powerful enough to handle all cases.
Using perl syntax for regexes:
m{
[ ][ ] # double spaces
|
\t # tab
|
$\s*$ # extra enter separated only by whitespace. Note: requires `m` flag
|
# XXX: it works only on simple xml
<[^<>]*$[^<>]*> # line break inside tag
|
# XXX: it works only on simple xml
[ ]</[^<>]+> |
</[^<>]+>[ ] # space after/before closing tag
}mxg;