0

Need single regular expression for finding

  1. double spaces
  2. tabs
  3. extra enter (line-break/carriage return)
  4. line break between tag
  5. space after/before closing tag

in an XML document.

Samuel Liew
  • 76,741
  • 107
  • 159
  • 260

2 Answers2

0

I am sorry this isn't going to be much of an answer, but perhaps I can help a little bit. Now for steps 1, 2, and 3 it would not be to difficult to match as such

a tab is \t

new lines are \r or \n

white space (a space, tab, or new-line) is \s

so 1 tab:(\t) two spaces = (\s\s) or (\s{2}) and an extra line is generally found by two linebreaks in a row, but sometimes they can be seperated by whitespace so watch out for that... (\r\s*\r)|(\n\s*\n)

to put it all together, steps 1/2/3 are with capturing groups :

  (\r\s*\r)|(\n\s*\n)|(\t)|(\s{2})

But what about 4 and 5?

Well they both require being able to backtrack in a regular expression in the event that a < does not have a corresponding >, or is not part of the document structure. This could happen with invalid XML, or in CDATA sections and such. It gets complicated but can be done with recursive regular expressions. However I don't know of a regex library in c++ that supports recursion. I'm sorry, but it would probably be much easier to just parse your string data by hand.

Now some regular expressions dialects are different, so your mileage may vary. For reference though I tend to use http://www.regular-expressions.info/quickstart.html

Jason
  • 2,617
  • 1
  • 16
  • 5
  • thnx for helpin but it is only finding tab i am creating log for xml documents which will have all above errors. – Chetan Patil Nov 11 '11 at 13:50
  • for double space and tab i am using "( )|\t" for other i dont have any idea.extra enter is finding \n after \n.so can u tell me that? – Chetan Patil Nov 11 '11 at 13:53
0

In general you need an xml parser to process xml documents. Regular expressions are not powerful enough to handle all cases.

Using perl syntax for regexes:

m{
  [ ][ ]  # double spaces
  |
  \t    # tab
  |    
  $\s*$  # extra enter separated only by whitespace. Note: requires `m` flag
  |  
  # XXX: it works only on simple xml
  <[^<>]*$[^<>]*> # line break inside tag
  |
  # XXX: it works only on simple xml
  [ ]</[^<>]+> |
  </[^<>]+>[ ] # space after/before closing tag
}mxg;

demo

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • i am working in c++ in visual studio.suppose is a tag having line break in it.will <[^<>]*\s+$[^<>]*> work for that tag.. ? – Chetan Patil Nov 12 '11 at 06:07
  • @Chetan Patil: [you can try it](http://regexr.com?2v65g). It matches if you remove `\s+`. I've edited the answer. – jfs Nov 12 '11 at 06:13
  • thanx for your help. but i want to find \n inside tag and also between opening and closing tag like \n. – Chetan Patil Nov 12 '11 at 07:58
  • @Chetan Patil: here's a [regex to find \n inside tag and also between opening and closing tag](http://regexr.com?2v66b). It works on a *small* subset of xml documents (that [don't have `<>` in attributes](http://stackoverflow.com/q/94528/) and `<` in the text. And I don't even mention comments, CDATA). – jfs Nov 12 '11 at 08:27
  • :thank u very much for your help.now last question :D what is $ and can i put \n in place of $ ? – Chetan Patil Nov 12 '11 at 10:32
  • [`$` Match the end of the line](http://perldoc.perl.org/perlre.html) (or before newline at the end) ... *Embedded newlines will not be matched by "^" or "$"* if there is no `m` flag. Here's [an example](http://ideone.com/vjB3X). – jfs Nov 12 '11 at 10:59