Java RegEx :What's the difference between .* and \s\S?

Question

I read the book or search the web,and the result is said that .\n is usually equal to \s\S or \d\D or \w\W, which means all character.But now I want to get the message from some string,I find that I can only use .\n.What's wrong with my code?Why can't I use \s\S expression?

String srcMsg="<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<root><resultCode>00000</resultCode><resultDesc><![CDATA[00000:<ResponseClass Name=\"Response\">\n    <ResponseSubClass Name=\"attributesResponse\">\n         <ITEM>0</ITEM>\n </ResponseSubClass>\n</ResponseClass>]]></resultDesc></root>";
//The right code 
java.util.regex.Pattern pP0 = java.util.regex.Pattern.compile(".*<!\\[CDATA\\[00000:((.|\n)*)\\]\\]>.*"); 
//wrong code1 
//java.util.regex.Pattern pP0 = java.util.regex.Pattern.compile(".*<!\\[CDATA\\[00000:(\\s|\\S)*\\]\\]>.*");
//wrong code2 
//java.util.regex.Pattern pP0 = java.util.regex.Pattern.compile(".*<!\\[CDATA\\[00000:[\\w|\\W]*\\]\\]>.*");
java.util.regex.Matcher mP0= pP0.matcher(srcMsg);
if(mP0.find())
para=mP0.group(1);
int dsi3 = para.indexOf("<ITEM>") + "<ITEM>".length();
int dsi4 = para.indexOf("</ITEM>");
System.out.println(Integer.valueOf(para.substring(dsi3, dsi4)));

`.` dot matches all but newline. `[\S\s]` is a class that has all of one thing and all of the things that are not the thing, result is it matches all characters. — , May 05 '19 at 16:52
You've been pointed to this before... Why do you ignore it? https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — OneCricketeer, May 05 '19 at 16:54
You REALLY don't want to attempt general parsing of X(HT)ML with regular expressions. This is like trying to use a blunt screwdriver to carve intricate hardwood scrollwork. The results will never be acceptable, and the developer who has to clean up your mess will curse your name for eternity. Do yourself a favor and use a real parser. — Jim Garrison, May 05 '19 at 18:02

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

By default, the . pattern doesn't match line terminators, i.e. what \R matches:

Any Unicode linebreak sequence, is equivalent to \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]

A [] character class that combines two opponent predefined character classes will match all characters, e.g. [\d\D], [\h\H], [\s\S], [\v\V], [\w\W], [\p{L}\P{L}], etc.

The . pattern can be changed to match all characters by setting the DOTALL flag, in one of these ways:

// Set flag external from pattern
Pattern.compile(".", Pattern.DOTALL)

// Set flag in the pattern
Pattern.compile("(?s).")

// Set flag in part of pattern
Pattern.compile("(?s:.)")

For your convenience, here is the javadoc of the DOTALL flag:

Enables dotall mode.

In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.

Dotall mode can also be enabled via the embedded flag expression (?s). (The s is a mnemonic for "single-line" mode, which is what this is called in Perl.)

score 1 · Accepted Answer · 2019-05-05T17:34:31.290

. dot matches all but newline. [\S\s] is a class that has
all of one thing and all of the things that are not the thing,
result is it matches all characters

The code below regex references group 1.
I believe you need an equivalent group 1 in the other 2 regex. Here they are:

1) https://regex101.com/r/Tp1k9m/1

 .* <!\[CDATA\[00000:
 (                             # (1 start)
      (?: . | \n )*            #    Should be *?
 )                             # (1 end)
 \]\]> .*

2) https://regex101.com/r/FdoHGl/1

 .* <!\[CDATA\[00000:
 (                             # (1 start)
      (?: \s | \S )*           #    Should be *?
 )                             # (1 end)
 \]\]> .*

3) https://regex101.com/r/t3vVcB/1

 .* <!\[CDATA\[00000:
 (                             # (1 start)
      [\w\W]*                  #    Was [\w|\W], fixed it.
                               #    Should be *?
 )                             # (1 end)
 \]\]> .*

Note that in character classes, there is an implicit OR
between items. So, you don't have to include an or symbol
in there unless you want to match a literal |

Also, just a note on using greedy operators in these regex.
It will go immediately to the end of the string and backtrack
until it finds a match, which overshoots all the closures.
( in this case \]\]> )

Java RegEx :What's the difference between .* and \s\S?

2 Answers2