3

I recently created a Perl script that searches for words that begin with D and and E with this code:

$infile = 'words.txt';
open(IN, $infile);
$count = 0;
while ($word = <IN>) {
chomp($word);
if ($word =~ /^d\w*e$/i) {
    print "$word\n";
    $count++;
  }
}
print "$count\n";

I recently decided to fork the code and create a script that searches for a word that is six letters and letters in the word are in alphabetic (A to Z) order. Instead of using words.txt, I plan to use the Unix standard dictionary located at usr/share/dict/words. How can I accomplish this by modifying this code?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
sharksfan98
  • 547
  • 3
  • 9
  • 19

4 Answers4

9

It looks like what you really need is an algorithm for checking whether the letters in a given word are in alphabetic sequence. There are several ways, but this subroutine works by splitting the word into a list of its constituent characters, sorting that list and recombining it. If the result matches the original word then that word was already sorted.

use strict;
use warnings;

use feature 'fc';

for (qw/ a ab ba cab alt effort toffee /) {
  print "$_\n" if in_alpha_order($_);
}

sub in_alpha_order {
  my $word = fc(shift);
  my $new = join '', sort $word =~ /./g;
  return $new eq $word;
}

output

a
ab
alt
effort

If you really wanted to do this in a regular expression, you could build an alternation like

a(?=[a-z]) | b(?=[b-z]) | c(?=[c-z]) ...

Here is a program that works that way. Its output is identical to that of the one above.

use strict;
use warnings;

my $regex = join '|', map "${_}(?=[$_-z])", 'a'..'z';
$regex = qr/^(?:$regex)*.$/i;

for (qw/ a ab ba cab alt effort toffee /) {
  print "$_\n" if $_ =~ $regex;
}
Borodin
  • 126,100
  • 9
  • 70
  • 144
  • note: a single Unicode codepoint can span multiple bytes and a single user-perceived character can span multiple Unicode codepoints. To get correct number of character in a word that might contain non-ascii characters, you could use `/\X/` and to sort alphabetically you could use `Unicode::Collate`. See [point 32 in @tchrist's answer](http://stackoverflow.com/a/6163129/4279). – jfs Jan 08 '13 at 14:07
2

To support non-ASCII words:

#!/usr/bin/perl
use strict;
use warnings;
use encoding 'utf8'; # Upgrade byte strings using UTF-8
use Unicode::Collate; # To sort letters alphabetically

use constant NCHARS => 6; # Consider only words with NCHARS characters in them
my $filename = '/usr/share/dict/words';
open (my $fh, '<:encoding(UTF-8)', $filename)
    or die "can't open '$filename' $!";

my $collator = Unicode::Collate::->new();
while (my $word = <$fh>) {
    chomp $word;
    my @chars = ($word =~ /\X/g); # Split word into characters
    # Print words with given length that have characters in alphabetical order
    print "$word\n" if (@chars == NCHARS &&
                        join('', $collator->sort(@chars)) eq $word);
}
close $fh;
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
jfs
  • 399,953
  • 195
  • 994
  • 1,670
1

Here's one option:

#!/usr/bin/env perl

use warnings;
use strict;

my $wordsFile = '/usr/share/dict/words';
my $count     = 0;

open my $fh, '<', $wordsFile or die $!;

while ( my $word = <$fh> ) {
    chomp $word;
    next unless length $word == 6;

    my $sorted = join '', sort split //, lc $word;

    if ( $sorted eq lc $word ) {
        print "$word\n";
        $count++;
    }
}

close $fh;

print "$count\n";

This splits the original word to alphabetize the letters. The letters are rejoined to form a new word. A comparison is then made against the original word. If they're the same, it's printed and counted.

Kenosis
  • 6,196
  • 1
  • 16
  • 16
1

I have a solution similar to Kenosis' and Borodin's, however you need to be careful about case. Perl's default sort function puts all capital letters before lower case. My version below takes care of this.

#!/usr/bin/env perl

use strict;
use warnings;

sub is_six_letter_word {
    my $word = shift;
    return length($word) == 6;
}

sub is_letters_in_alphabetical_order {
    my $word = shift;
    $word = fc($word);

    my @chars = split("", $word);
    my $sorted_word = join("", sort(@chars));

    return $word eq $sorted_word;
}

open(my $fh_in, $ARGV[0]) or die "Error opening input file";

my $word = undef;
while ($word = <$fh_in>) {
    chomp($word);
    if (is_six_letter_word($word) && is_letters_in_alphabetical_order($word)) {
        printf("%s\n", $word);
    }
}

close($fh_in);
ajmccluskey
  • 505
  • 4
  • 10
  • The two `return`s are an eyesore. Why not just `return length($word) == 6` ? – Zaid Jan 07 '13 at 15:38
  • Thanks for the feedback. I think I got into the habit of explicitly returning 1 or 0 for true/false so I knew what I was getting when I started with Perl. Thinking about it again I don't think it was ever necessary. – ajmccluskey Jan 07 '13 at 20:44