Which characters can be used as regular expression delimiters?

Question

Which characters can be used as delimiters for a Perl regular expression? m/re/, m(re) and måreå all seem to work, but I'd like to know all possibilities.

score 32 · Accepted Answer · answered Apr 24 '11 at 13:01

32

From perlop:

With the m you can use any pair of non-whitespace characters as delimiters.

So anything goes, except whitespace. The full paragraph for this is:

If "/" is the delimiter then the initial m is optional. With the m you can use any pair of non-whitespace characters as delimiters. This is particularly useful for matching path names that contain "/", to avoid LTS (leaning toothpick syndrome). If "?" is the delimiter, then the match-only-once rule of ?PATTERN? applies. If "'" is the delimiter, no interpolation is performed on the PATTERN. When using a character valid in an identifier, whitespace is required after the m.

answered Apr 24 '11 at 13:01

Mat

202,337
40
393
406

Thanks, I was looking in perlre. – Tim Apr 24 '11 at 13:06
That's because they aren't regex delimiters, they are operator delimiters. The regexes are what happens inside the delimiters. – Dave Cross Apr 24 '11 at 13:14
7

Theory and practice conflict a bit here. – tchrist Apr 24 '11 at 13:20
1

There are four special character that you use in pairs: () [] {} <>. Example: perl -nlE'if(m){say"FOO"}' – shawnhcorey Apr 24 '11 at 13:27
@shawnhcorey Is that non-regex specific but rather perl specific? I couldn't understand from documentation. Thanks. – Dejan Marjanović Apr 11 '14 at 19:20
@TOOTSKI, I don't know. I use Perl almost exclusively. – shawnhcorey Apr 11 '14 at 23:14
Note that you use an `s`, not an `m`, when doing a replace (aka substitute) with regular expressions. http://www.perlfect.com/articles/regex.shtml – Mashmagar Nov 24 '14 at 16:22
The line from perlop actually says "With the m you can use any pair of non-whitespace (ASCII) characters as delimiters." (as of v5.36). There are many non-ASCII characters that you can use, but there are some exceptions for some paired characters. – brian d foy Jan 02 '23 at 14:54

score 7 · Answer 2 · answered Apr 24 '11 at 14:08

7

As is often the case, I wonder "can I write a Perl program to answer that question?".

Here is a pretty good first approximation of trying all of the printable ASCII chars:

#!/usr/bin/perl
use warnings;
use strict;

$_ = 'foo bar'; # something to match against

foreach my $ascii (32 .. 126) {
    my $delim = chr $ascii;
    next if $delim eq '?'; # avoid fatal error

    foreach my $m ('m', 'm ') {  # with and without space after "m"
        my $code = $m . $delim . '(\w+)' . $delim . ';';
#        print "$code\n";
        my $match;
        {
            no warnings 'syntax';
            ($match) = eval $code;
        }
        print "[$delim] didn't compile with $m$delim$delim\n" if $@;
        if (defined $match and $match ne 'foo') {
            print "[$delim] didn't match correctly ($match)\n";
        }
    }
}

answered Apr 24 '11 at 14:08

tadmc

3,714
16
14

1

Neat solution, but it's going to take a while to go through all unicode characters. – Tim Apr 24 '11 at 14:15
2

Don't worry @Tim Nordenfur, I'm sure he doesn't have to pay the computer overtime :) – ikegami Apr 25 '11 at 05:19
Heh, I went to look through everything from 0 to 0x10ffff. About 75% of the characters don't compile as delimiters even with spaces. Most of that is above 0xFFFF though, so that weights that number considerably. – brian d foy Dec 22 '22 at 12:24

score 6 · Answer 3 · edited Jul 28 '16 at 02:24

6

There is currently a bug in the lexer that sometimes prevents UTF-8 characters from being used as a delimiter, even though you can sneak Latin1 by it if you aren't in full Unicode mode.

edited Jul 28 '16 at 02:24

Laurel

5,965
14
31
57

answered Apr 24 '11 at 13:20

tchrist

78,834
30
123
180

any specific chars? `use utf8; ...; $str =~ m ê ê;` works as expected here within an UTF-8 encoded script. – Mat Apr 24 '11 at 13:28
@Mat, note that that's a latin-1 character. – Tim Apr 24 '11 at 13:52
2

`$str =~ m ش ش` parses (and works) too, and that's not latin1 (arabic iso-8859-6). – Mat Apr 24 '11 at 13:54
@Mat, I don’t have time to show you the example bugs right now, but they do exist. I bumped into them again a couple of day ago, but I have to run right now. – tchrist Apr 24 '11 at 15:13

score 5 · Answer 4 · answered Apr 24 '11 at 13:01

5

Just about any non-whitespace character can be used, though identifier characters have to be separated from the initial m by whitespace. Though when you use a single quote as the delimiter, it disables interpolation and most backslash escaping.

answered Apr 24 '11 at 13:01

ysth

96,171
6
121
214

Which characters can be used as regular expression delimiters?

4 Answers4

Linked