1

There are already several good discussions of regular expressions and empty lines on SO. I'll remove this question if it is a duplicate.

Can anyone explain why this script outputs 5 3 4 5 4 3 instead of 4 3 4 4 4 3? When I run it in the debugger $blank and $classyblank stay at "4" (which I assume is the correct value) until the just before the print statement.

my ( $blank, $nonblank, $non_nonblank, 
     $classyblank,  $classyspace, $blanketyblank ) = 0 ;

while (<DATA>) {

  $blank++ if /\p{IsBlank}/         ; # POSIXly blank - 4?
  $nonblank++ if /^\P{IsBlank}$/    ; # POSIXly non-blank - 3
  $non_nonblank++ if not /\S/       ; # perlishly not non-blank - 4
  $classyblank++ if /[[:blank:]]/   ; # older(?) charclass blankness - 4?
  $classyspace++ if /^[[:space:]]$/ ; # older(?) charclass whitespace - 4
  $blanketyblank++ if /^$/          ; # perlishly *really empty*  - 3

}

print join " ", $blank, $nonblank, $non_nonblank,
            $classyblank, $classyspace, $blanketyblank , "\n" ;

__DATA__

line above only has a linefeed this one is not blank because: words

this line is followed by a line with white space (you may need to add it)

then another blank line following this one

THE END :-\

Is it something to do with the __DATA__ section or am I misunderstanding POSIX regular expressions?


ps:

As noted in comment on a timely post elsewhere, "really empty" (/^$/) can miss non-emptiness:

perl -E 'my $string = "\n" . "foo\n\n" ; say "empty" if $string =~ /^$/ ;'
perl -E 'my $string = "\n" . "bar\n\n" ; say "empty" if $string =~ /\A\z/ ;'
perl -E 'my $string = "\n" . "baz\n\n" ; say "empty" if $string =~ /\S/ ;' 
G. Cito
  • 6,210
  • 3
  • 29
  • 42
  • Then there's `if /\A\Z/ ` and `if /\A\z/ ` ... which are pretty consistent across different languages [except python but that's OK](http://stackoverflow.com/questions/7063420/perl-compatible-regular-expression-pcre-in-python). – G. Cito Mar 22 '16 at 18:56
  • `This is perl 5, version 22, subversion 0 (v5.22.0) built for amd64-freebsd` – G. Cito Mar 22 '16 at 19:07
  • 1
    Not related to your core question, but `my $string = "\n", "foo\n\n"` assigns a single newline to `$string`. The rest is thrown away because of the comma operator. – ThisSuitIsBlackNot Mar 22 '16 at 19:31
  • 1
    I addressed this at length in [my answer to your recent comment](http://stackoverflow.com/questions/36128040/extract-and-filter-a-range-of-lines-from-the-input-using-perl/36129515?noredirect=1#comment59963005_36129515) I won't write it up again to match your newly-phrased question. The only delinquent in the patterns you have used is `$`, which will match the end of a string or before the newline of it is the last character. `\p{IsBlank}`, `[[:blank:]]` are simple character classes and you can check what they do from [perldoc perluniprops](http://perldoc.perl.org/perluniprops.html) – Borodin Mar 22 '16 at 19:41
  • @Borodin - thanks. I'm trying to get straightened out about the character classes from the charts in [`perlrecharclass`](http://perldoc.perl.org/perlrecharclass.html) by lining them up with well their known perl equivalents (such as `/\S/`) and/or related "idioms". I was getting results I couldn't explain: specifically how `\v`, \s' and `\h` interact with `\n` and `" "`. I think I have it figured out now and will add a separate answer if one doesn't appear. – G. Cito Mar 23 '16 at 01:12
  • @thissuitisnotblack typo ... I think. Still not sure if correction illustrates the need to approach non-emptiness carefully à la borodin – G. Cito Jul 13 '17 at 17:33
  • *i.e.* carefully à la @borodin – G. Cito Jul 13 '17 at 17:33
  • @G.Cito: I don't understand what issue you have. Your main mistake is to expect any character class to identify whether a line is "empty" or "blank". I suggest you open anew question. – Borodin Jul 13 '17 at 17:43

1 Answers1

2

/\p{IsBlank}/ doesn't check for a empty string. \p matches a character that has the specified Unicode property.

$ unichars '\p{IsBlank}' | cat
 ---- U+0009 CHARACTER TABULATION
 ---- U+0020 SPACE
 ---- U+00A0 NO-BREAK SPACE
 ---- U+1680 OGHAM SPACE MARK
 ---- U+2000 EN QUAD
 ---- U+2001 EM QUAD
 ---- U+2002 EN SPACE
 ---- U+2003 EM SPACE
 ---- U+2004 THREE-PER-EM SPACE
 ---- U+2005 FOUR-PER-EM SPACE
 ---- U+2006 SIX-PER-EM SPACE
 ---- U+2007 FIGURE SPACE
 ---- U+2008 PUNCTUATION SPACE
 ---- U+2009 THIN SPACE
 ---- U+200A HAIR SPACE
 ---- U+202F NARROW NO-BREAK SPACE
 ---- U+205F MEDIUM MATHEMATICAL SPACE
 ---- U+3000 IDEOGRAPHIC SPACE

It matches " \n" since SPACE has the IsBlank property.


/[[:blank:]]/ doesn't check for a empty string. [...] matches a character that is a member of the specified class.

$ unichars '[[:blank:]]' | cat
 ---- U+0009 CHARACTER TABULATION
 ---- U+0020 SPACE
 ---- U+00A0 NO-BREAK SPACE
 ---- U+1680 OGHAM SPACE MARK
 ---- U+2000 EN QUAD
 ---- U+2001 EM QUAD
 ---- U+2002 EN SPACE
 ---- U+2003 EM SPACE
 ---- U+2004 THREE-PER-EM SPACE
 ---- U+2005 FOUR-PER-EM SPACE
 ---- U+2006 SIX-PER-EM SPACE
 ---- U+2007 FIGURE SPACE
 ---- U+2008 PUNCTUATION SPACE
 ---- U+2009 THIN SPACE
 ---- U+200A HAIR SPACE
 ---- U+202F NARROW NO-BREAK SPACE
 ---- U+205F MEDIUM MATHEMATICAL SPACE
 ---- U+3000 IDEOGRAPHIC SPACE

It matches " \n" since SPACE is a member of the [:blank:] POSIX character class and thus a member of the [[:blank:]] character class.

ikegami
  • 367,544
  • 15
  • 269
  • 518
  • Thanks, beginning to grok ... since `$nonblank++ if /\P{IsBlank}/` (without the anchors) gives me "8" (`__DATA__` has 8 lines) I assume it is counting `\n` as non-`{IsBlank}` (due to `\P`) and thus is seeing 8 matches. Then, as `/^\P{IsBlank}$/`, incrementing is based on the three lines of single non-blank horizontal characters (`\n`) so I get "3". However `/\p{IsBlank}` gives me a count of "5" because there are five rows with `\s` style horizontal "blank characters": the four with text (and whitespace between words), and line number 5 which consists of `" "\n` appearing as an empty row. – G. Cito Mar 23 '16 at 01:21