6

When we include shorthand for character class and negated-character class in same character class, is it same as dot . which mean any character ?

I did a test on regex101.com and every character matched.

Is [\s\S] [\w\W] and [\d\D] same as . ?

I want to know if this behavior is persistent in web's front and backend languages like Javascript, Php, Python and others.

Rahul
  • 2,658
  • 12
  • 28
  • 2
    http://www.regular-expressions.info/dot.html – Bergi May 29 '17 at 15:43
  • I wonder what kind of answer is expected here. It sounds too broad as the regex flavor is not indicated. The answer "it depends" does not really help much future visitors. A dot matches rather differently across even Perl originated regex engines, but the `[\s\S]` like constructs also do not act the same way in POSIX and non-POSIX based regex engines. – Wiktor Stribiżew May 29 '17 at 19:14
  • @WiktorStribiżew: Updated my question. – Rahul May 29 '17 at 20:58
  • @WiktorStribiżew - Why couldn't there be a canonical answer that cataloged the major flavors worth documenting? I've seen answers like that before. It would be nice, actually, to have one spot that was updated to find that kind of explanation. – Jared Farrish May 29 '17 at 21:51
  • @JaredFarrish: I am sorry, I do not quite get you. Yes, there could be such an answer. The question sounded too broad in the beginning and it is still rather broad since OP wants to cover major NFA regex engines (and as you see, JS, Python and PHP really differ in what they can do with a dot (though, they are unanimous as far as `[\s\S]` is concerned). – Wiktor Stribiżew May 29 '17 at 21:57

3 Answers3

6

"No" it is not the same. It has an important difference if you are not using the single line flag (meaning that . does not match all).

The [\s\S] comes handy when you want to do mix of matches when the . does not match all.

It is easier to explain it with an example. Suppose you want to capture whatever is between a and b, so you can use pattern a(.*?)b (? is for ungreedy matches and parentheses for capturing the content), but if there are new lines suppose you don't want to capture this in the same group, so you can have another regex like a([\s\S]*?)b.

Therefore if we create one pattern using both approaches it results in:

a(.*)b|a([\s\S]*?)b

enter image description here

In this case, if you see the scenario in regex101, then you will have a colorful and easy way to differentiate the scenarios (in green capturing group #1 and in red capturing group #2): enter image description here

So, in conclusion, the [\s\S] is a regex trick when you want to match multiple lines and the . does not suit your needs. It basically depends on your use case.

However, if you use the single line flag where . matches new lines, then you don't need the regex trick, below you can see that all is green and group 2 (red above) is not matched:enter image description here

Have also created a javascript performance test and it impacts in the performance around 25%:

https://jsperf.com/ss-vs-dot

enter image description here

Federico Piazza
  • 30,085
  • 15
  • 87
  • 123
  • When you need to "mix of matches", in PHP, you may use `(?s:.*?)Now, turn the DOTALL mode(?-s:.*)end of line.` There is much more to it, in fact. In Python, you cannot use modifier groups, then, `[\d\D]` is really very handy. In JS, `[\s\S]` is still a workaround, since its native `[^]` does the job. `[\s\S]` is a portable construct across NFA regexps, that is why it is so popular. – Wiktor Stribiżew May 29 '17 at 21:35
  • Hey @WiktorStribiżew thanks for the comment, it's always very cool learning from you – Federico Piazza May 29 '17 at 21:42
1

The answer is: It depends.
If your regex engine does match every character with . then yes, the result is the same. If it doesn't then no, the result is not the same. In standard JavaScript . , for example, does not match line breaks.

m00hk00h
  • 507
  • 2
  • 7
  • 21
  • 1
    In most other languages, `.` will not match linefeeds either by default but a flag will be available to change this behaviour. – Aaron May 29 '17 at 16:05
  • *If your regex engine does match every character with `.` then yes, the result is the same.* - `.` matches any char only with POSIX based regex engines, and then `[\s\S]` is parsed as a literal ``\`` and `s` / `S` - so, the answer is *no*. – Wiktor Stribiżew May 29 '17 at 20:22
  • @WiktorStribiżew: In my [test](https://regex101.com/r/SjRUx0/2/) `[\s\S]` is not acting as literal characters. – Rahul May 29 '17 at 21:08
  • 1
    @Rahul Your test is not performed on a POSIX regex. Sure, an NFA regex treats shorthand character classes inside character classes as shorthand character classes. And as far as I see, you mention the NFA regex engines in your updated question now. – Wiktor Stribiżew May 29 '17 at 21:27
0

The "." does not match the newline character. And it does not match them even in Perl multiline matches. So, with a little Perl script like

#!/usr/bin/perl -w
use strict;
$/="---";
my $i=0;
my $patA='a[\d\D]b';
my $patB='a.b';
while(<>){
    $i++;
    print "$i: $_";
    print "    patA matches\n" if $_ =~ /$patA/;
    print "    patB matches\n" if $_ =~ /$patB/;
}

you can pipe some input to test to it like

$ cat |./aboveskript.pl
a
b

Please leave with CTRL-D, for multiple records separate them with three dashes. The output of the above is

1: a
b
    patA matches

So the pattern /a.b/ fails.

smoe
  • 500
  • 3
  • 13