3

I have a lot of trouble understanding the basic rules of regex and hope that someone could help explain them in "plain English".

$_ = '1: A silly sentence (495,a) *BUT* one which will be useful. (3)';

print "Enter a regular expression: ";
my $pattern = <STDIN>;
chomp($pattern);

if (/$pattern/) {
    print "The text matches the pattern '$pattern'.\n";
    print "\$1 is '$1'\n" if defined $1;
    print "\$2 is '$2'\n" if defined $2;
    print "\$3 is '$3'\n" if defined $3;
    print "\$4 is '$4'\n" if defined $4;
    print "\$5 is '$5'\n" if defined $5;
}

Three test outputs

Enter a regular expression: ([a-z]+)
The text matches the pattern '([a-z]+)'
$1 is 'silly'

Enter a regular expression: (\w+)
The text matches the pattern '(\w+)'
$1 is '1'

Enter a regular expression: ([a-z]+)(.*)([a-z]+)
The text matches the pattern '([a-z]+)(.*)([a-z]+)'
$1 is 'silly'
$2 is " sentence (495,a) *BUT* one which will be usefu'
$3 is 'l'

My confusion is as follows

  1. doesn't ([a-z]+) mean "a lower case alphabet and one/more repeats"? If so, shouldn't "will" be picked up as well? Unless it has something to do with () being about memory (i.e. "silly" being 5-letter word, so "will" will not picked up, but "willx" will ??)

  2. doesn't (\w+) mean "any word and one/more repeats"? If so, why is number "1" picked up as there is no repeat but a colon ":" afterwards?

  3. does ([a-z]+)(.*)([a-z]+)mean "any lower case and repeat", immediately followed by "anything and 0 or more repeat", immediately followed by "any lower case and repeat"? If so, why does the output look like the one shown above?

I tried to look up online as much as I could but still fail to understand them. Any help will be greatly appreciated. Thank you.

B Chen
  • 923
  • 2
  • 12
  • 21

2 Answers2

6
  1. No, it means "one or more unaccented lowercase latin letter".

    Yes, "will" would also match, but the match op only returns the first match unless you use /g.

    print "$1\n" while /([a-z]+)/g;  # //g in scalar context
       or
    print "$_\n" for /([a-z]+)/g;    # //g in list context
    

    See m/PATTERN/ in perlop for details on how to use /g.

  2. No, it means "one or more word chars", so it can indeed match a single character.

    Or maybe you're surprised that 1 is a word char? In the ASCII range, the word chars are A-Z, a-z, 0-9 and _. Another 102,661 word chars are found outside of the ASCII range.

  3. It means "one or more unaccented lowercase latin letter, followed by any number of characters other than newline, followed by one or more unaccented lowercase latin letter".

    If you're asking why .* is matching so much, the engine will always match as much as possible at the current location. This is called greediness.

    Maybe you're looking for /([a-z]+)([^a-z]+)([a-z]+)/.

ikegami
  • 367,544
  • 15
  • 269
  • 518
0
  1. I'm really not sure why you would expect that. It looks at your sentence and finds the first lowercase letter and continues matching them until it doesn't find one. (In your case a space) the match is 'silly' and it should be. The matching stops at that point.

  2. \w matches a "word character" and includes numbers but not punctuation aside from the "_" ":" is not a word character thus you get "1" and nothing else.

  3. This is because (.*) is "greedy" (and generally you shouldn't use it). You're telling Perl to match anything and everything to the end of the line. It then backtracks to give you a match for your last check which is the last character of your string.

EDIT: as @ikegami pointed out it \w actually matches quite a bit more than what I was thinking.

Cfreak
  • 19,191
  • 6
  • 49
  • 60