3

I want to use switch/case construct in Perl. I have a file that contains a sequence of words, I want to do different treatment for each line according to the number of words that the line contains.

An example file:

w1 w2 w2
w1 w3

So the script will look something like this, but how do I calculate the number of word in each line?

given ($number_of_word_in_line) {
   when ($_ > 2) {
       ...
   }
   when ($_ > 3) {
       ...
   }
   default {
       ...
   }
}
Grant McLean
  • 6,898
  • 1
  • 21
  • 37
nicha
  • 51
  • 5
  • Read the line in a loop make a hash key of that word and give count (No of times word appeared) as value. – AbhiNickz May 10 '17 at 21:32

2 Answers2

5

Please be careful with the switch statement which is highly experimental

As previously mentioned, the "switch" feature is considered highly experimental; it is subject to change with little notice. In particular, when has tricky behaviours that are expected to change to become less tricky in the future. Do not rely upon its current (mis)implementation. Before Perl 5.18, given also had tricky behaviours that you should still beware of if your code must run on older versions of Perl.

These are tricky and will change.

Having said that, one way to count words in a string is to split it first

use warnings;
use strict;
use feature 'switch';

my $file = '...';
open my $fh, '<', $file  or die "Can't open $file: $!";

while (my $line = <$fh>)
{
    chomp $line;
    my @words = split ' ', $line;
    my $num_words = @words;
    
    given ($num_words) {
        when ($num_words > 2) { 
            # ...
        }
    }
}
close $fh;

what uses the fact that a scalar ($num_words) when assigned an array (@words) receives the number of elements of the array. See Context in perldata

Assignment is a little bit special in that it uses its left argument to determine the context for the right argument. Assignment to a scalar evaluates the right-hand side in scalar context, [...]

and an array evaluated in scalar context yields the number of its elements.

Here we can skip the array altogether

my $num_words = split ' ', $line;

So in order to get the count without creating an array variable we need to directly assign to a scalar, but that isn't always going to yield the length of the list; putting the right-hand-side in scalar context -- by assignment to a scalar -- may affect how it operates and what it returns.

There are workarounds though. For example

my $num_words = () = $line =~ /\w+/g;

where the "operator" = () = is a play on context, or

my $num_words = @{ [ $line =~ /\w+/g ] };

where the [] takes a reference to the list inside and is then derefenced by @{ }, what just evaluates to a list regardless of context and so can be assigned to a scalar whereby such scalar assignment returns the number of elements in that list.§

See this page for a wealth of information about lists, arrays, scalars, and context.


This can be done more compactly as

while (<$fh>) {
    chomp;
    my $num_words = split;
    # ...
}

The default for while, chomp, and split is the $_ variable. The split also needs a pattern and the default is ' ', so the above is the same as split ' ', $_. The pattern ' ' is special for split and matches any amount of any whitespace, also discarding leading and trailing space.

Note that once we assign to a variable inside the while condition (like to the $line in the main text) then the deal with $_ is off -- it is undef. So either our variable or $_. A reasonable rule of thumb is that if you end up using $_ more than once or twice then there should be a proper variable. And if ever in doubt, introduce a nice variable.

Regex's match operator returns the actual matches when in list context but only true/false when in scalar context. (And, in scalar context that /g doesn't make sense.)

§ Another example is split, which returns the size of the list in scalar context.

zdim
  • 64,580
  • 5
  • 52
  • 81
3

Counting the number of words on a line is a problem with many possible solutions. Here's a very simple one:

sub count_words {
    my($line) = @_;

    my @words = split ' ', $line;
    return scalar(@words);
}

my $line = " The  quick brown fox jumps over the  lazy dog \n";

say "count_words(): " . count_words($line);  # prints '9'

Normally Perl's split function treats the first argument as a regex, but if the argument is a string containing exactly one space then leading whitespace is discarded, and the regex /\s+/ is used. This allows skipping over multiple consecutive whitespace characters and also causes trailing whitespace to be discarded.

You didn't mention what type of 'words' you want to count. Is it written language? Will there be punctuation? Is it ASCII text? Depending on the answers to these questions, you might get better results using a regex to "capture" words:

sub count_words {
    my($line) = @_;

    my @words = $line =~ /(\w+)/g;
    return scalar(@words);
}

This will cope with missing spaces around punctuation (e.g.: "one,two,three" will be seen as three words whereas split would see it as one). But it won't work with apostrophes (e.g.: "won't" will be seen as two words) and it won't work with non-ASCII characters (e.g.: "réfrigérateur" will be seen as three words).

To include an apostrophe in the list of characters that make up a word, you could change the regex line to:

    my @words = $line =~ /([\w']+)/g;

However if your text has had the ASCI apostrophes changed to "smart quote" characters then you might need something like:

    my @words = $line =~ /([\w'\x{2019}]+)/g;

To allow the \w part of the regex to match accented characters, you can add this at the top of your script:

use utf8;

That seems to work regardless of whether a character like é is represented as the single codepoint U+00E9 or as two codepoints with a plain letter and a combining character accent: U+0065 U+0301.

Another user's comment on your question suggested they thought you might be wanting to count unique words on a line (e.g.: " one plus one" would be seen as two unique words). If so, you'll need to use a hash to reduce @words to a unique list.

Grant McLean
  • 6,898
  • 1
  • 21
  • 37