2

I have a large word list file with one word per line. I would like to filter out the words with repeating alphabets.

INPUT:
  abducts
  abe
  abeam
  abel
  abele

OUTPUT:
  abducts
  abe
  abel

I'd like to do this using Regex (grep or perl or python). Is that possible?

Vivin Paliath
  • 94,126
  • 40
  • 223
  • 295
Amjith
  • 22,626
  • 14
  • 43
  • 38
  • Why is `abel` in the output? I would have thought that `abe` would trump it. – chrisaycock Apr 12 '11 at 15:46
  • 2
    @chrisaycock: I think the OP means reject any words which (individually) contain more than one instance of any letter – Colin Fine Apr 12 '11 at 15:48
  • 2
    Why do you want to use a regex? While it might be doable, it's definitely going to be worse performance and much less readable than just counting characters. – Mark Tozzi Apr 12 '11 at 15:50
  • It's not doable. You don't have any way to store information. You need something like a DFA, probabilistic automaton or something with a stack – joe_coolish Apr 12 '11 at 15:54
  • If you want to scare people, do something like this `cat words | perl -ne '@c=split(//);$f=1;%m=();foreach $w(@c){if($m{$w}){$f=0;}else{$m{$w}=1;}}print if$f;'` – Vivin Paliath Apr 12 '11 at 15:55
  • 2
    @joe_coolish: Please stop spreading FUD: it is trivially possible with `!/(\pL).*?\1/si`. When theory meets practice, theory loses. – tchrist Apr 12 '11 at 17:43
  • 2
    @Mark: Unless you’re writing in low-level C, which I do not believe an option, counting characters is going to take more time in a high-level language than a callout to that language’s regex library, which is already in tight C code, is likely to take. This is not always what we would expect, but it is indeed so. – tchrist Apr 12 '11 at 19:10
  • @tchrist: I didn't believe you, but I just tested it myself and in perl at least the accepted answer runs more than twice as quickly as a character based solution I banged up. Thanks for calling my attention to it. – Mark Tozzi Apr 13 '11 at 15:16

10 Answers10

7

It's much easier to write a regex that matches words that do have repeating letters, and then negate the match:

my @input = qw(abducts abe abeam abel abele);
my @output = grep { not /(\w).*\1/ } @input;

(This code assumes that @input contains one word per entry.) But this problem isn't necessarily best solved with a regex.

I've given the code in Perl, but it could easily be translated into any regex flavor that supports backreferences, including grep (which also has the -v switch to negate the match).

cjm
  • 61,471
  • 9
  • 126
  • 175
  • Nice. I wonder how that compares speed-wise to a split and count dups method like this: `my @output = grep { my @l = split ''; my %l; @l{@l} = (); keys %l == @l } @input;` It seems like you could have a lot of backtracking... – daotoad Apr 12 '11 at 16:05
  • 1
    @daotoad, the regex runtime is quadratic on the length of the word. The `split` approach is linear. But words are short enough that I'm not sure which approach would actually be faster in practice. You'd have to benchmark it. – cjm Apr 12 '11 at 16:12
  • 2
    FWIW the reverse regex (simplified to assume only word characters in the string) is `/^(?:(.)(?!.*?\1))+\z/` – ysth Apr 12 '11 at 16:21
  • The quadratic time for the regex is worst case. Best case is constant time (but you have to have all the words in the list start with a pair of matched letters to get it). In it will vary with the word list. The regex will be very good at words like `aardvark`, but bad at words like `bumblebee`. Are the words long or short (for some definition of short and long)? Are there many or few repeated characters? Agreed, benchmarking with a representative corpus is the only way to know. – daotoad Apr 12 '11 at 16:26
  • @daotoad, `bumblebee` has repeated `b`'s that the regex will find quickly. The actual worst case is the words that we're trying to find, with no repeated letters. The regex has to try every possible matching pair before giving up. – cjm Apr 12 '11 at 16:31
  • @daodad: `bumblebee` is realy bad example for worst case, as it will end the regexp jsut on `b` :-), worst case for this regexp is word with no duplicit letter. – anydot Apr 12 '11 at 16:32
  • 2
    I'd expect the regex to be much much faster than the dup-counting, for realistic word lengths. I'd change it to be non-greedy though. – ysth Apr 12 '11 at 16:50
  • I decided to take the regex and apply it with grep -v as shown by @tchrist below. – Amjith Apr 12 '11 at 18:49
  • Yep, bumblebee sucks pretty badly as a hard word that still yields match. Finding a pathological example that matches is pretty tough. – daotoad Apr 12 '11 at 20:08
5
$ egrep -vi '(.).*\1' wordlist
tchrist
  • 78,834
  • 30
  • 123
  • 180
3

It is possible to use regex:

import re

inp = [
    'abducts'
,   'abe'
,   'abeam'
,   'abel'
,   'abele'
]

# detect word which contains a character at least twice
rgx = re.compile(r'.*(.).*\1.*') 

def filter_words(inp):
    for word in inp:
        if rgx.match(word) is None:
            yield word

print list(filter_words(inp))
Simon Bergot
  • 10,378
  • 7
  • 39
  • 55
  • OK, I admit it: that is a formally elegant solution, and as far as I can see it will work. Whether it will work in a sensible time, I'm not so sure. – Colin Fine Apr 12 '11 at 16:23
  • 2
    Of course. The solution from chrisaycock is the best one if you don't force yourself to use regular expression. – Simon Bergot Apr 12 '11 at 16:27
3

Simple Stuff

Despite the inaccurate protestation that this is impossible with a regex, it certainly is.

While @cjm justly states that it is a lot easier to negate a positive match than it is to express a negative one as a single pattern, the model for doing so is sufficiently well-known that it becomes a mere matter of plugging things into that model. Given that:

    /X/

matches something, then the way to express the condition

    ! /X/

in a single, positively-matching pattern is to write it as

    /\A (?: (?! X ) . ) * \z /sx

Therefore, given that the positive pattern is

    / (\pL) .* \1 /sxi

the corresponding negative needs must be

    /\A (?: (?! (\pL) .* \1  ) . ) * \z /sxi

by way of simple substitution for X.

Real-World Concerns

That said, there are extenuating concerns that may sometimes require more work. For example, while \pL describes any code point having the GeneralCategory=Letter property, it does not consider what to do with words like red‐violet–colored, ’Tisn’t, or fiancée — the latter of which is different in otherwise-equivalent NFD vs NFC forms.

You therefore must first run it through full decomposition, so that a string like "r\x{E9}sume\x{301}" would correctly detect the duplicate “letter é’s” — that is, all canonically equivalent grapheme cluster units.

To account for such as these, you must at a bare minimum first run your string through an NFD decomposition, and then afterwards also use grapheme clusters via \X instead of arbitrary code points via ..

So for English, you would want something that followed along these lines for the positive match, with the corresponding negative match per the substitution give above:

    NFD($string) =~ m{
        (?<ELEMENT>
           (?= [\p{Alphabetic}\p{Dash}\p{Quotation_Mark}] ) \X 
        )
        \X *
        \k<ELEMENT>
    }xi

But even with that there still remain certain outstanding issues unresolved, such as for example whether \N{EN DASH} and \N{HYPHEN} should be considered equivalent elements or different ones.

That’s because properly written, hyphenating two elements like red‐violet and colored to form the single compound word red‐violet–colored, where at least one of the pair already contains a hyphen, requires that one employ an EN DASH as the separator instead of a mere HYPHEN.

Normally the EN DASH is reserved for compounds of like nature, such as a time–space trade‐off. People using typewriter‐English don’t even do that, though, using that super‐massively overloaded legacy code point, HYPHEN-MINUS, for both: red-violet-colored.

It just depends whether your text came from some 19th‐century manual typewriter — or whether it represents English text properly rendered under modern typesetting rules. :)

Conscientious Case Insensitivity

You will note I am here considering letter that differ in case alone to be the same one. That’s because I use the /i regex switch, ᴀᴋᴀ the (?i) pattern modifier.

That’s rather like saying that they are the same as collation strength 1 — but not quite, because Perl uses only case folding (albeit full case folding not simple) for its case insensitive matches, not some higher collation strength than the tertiary level as might be preferred.

Full equivalence at the primary collation strength is a significantly stronger statement, but one that may well be needed to fully solve the problem in the general case. However, that requires a lot more work than the problem necessarily requires in many specific instances. In short, it is overkill for many specific cases that actually arise, no matter how much it might be needed for the hypothetical general case.

This is made even more difficult because, although you can for example do this:

    my $collator = new Unicode::Collate::Locale::
                       level => 1, 
                       locale => "de__phonebook",
                       normalization => undef,
                    ;

    if ($collator->cmp("müß", "MUESS") == 0) { ... }

and expect to get the right answer — and you do, hurray! — this sort of robust string comparison is not easily extended to regex matches.

Yet. :)

Summary

The choice of whether to under‐engineer — or to over‐engineer — a solution will vary according to individual circumstances, which no one can decide for you.

I like CJM’s solution that negates a positive match, myself, although it’s somewhat cavalier about what it considers a duplicate letter. Notice:

    while ("de__phonebook" =~ /(?=((\w).*?\2))/g) {
        print "The letter <$2> is duplicated in the substring <$1>.\n";
    } 

produces:

    The letter <e> is duplicated in the substring <e__phone>.
    The letter <_> is duplicated in the substring <__>.
    The letter <o> is duplicated in the substring <onebo>.
    The letter <o> is duplicated in the substring <oo>.

That shows why when you need to match a letter, you should alwasy use \pL ᴀᴋᴀ \p{Letter} instead of \w, which actually matches [\p{alpha}\p{GC=Mark}\p{NT=De}\p{GC=Pc}].

Of course, when you need to match an alphabetic, you need to use \p{alpha} ᴀᴋᴀ\p{Alphabetic}, which isn’t at all the same as a mere letter — contrary to popular misunderstanding. :)

tchrist
  • 78,834
  • 30
  • 123
  • 180
2

If you're dealing with long strings that are likely to have duplicate letters, stopping ASAP may help.

INPUT: for (@input) {
   my %seen;
   while (/(.)/sg) {
      next INPUT if $seen{$1}++;
   }
   say;
}

I'd go with the simplest solution unless the performance is found to be really unacceptable.

my @output = grep !/(.).*?\1/s, @input;
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • 1
    One might wish to apply `/i` to effect a case-insensitive match. Depends on whether *Roger* is held to contain a duplicate letter. I tend to think it does, but then, I feel the same about *René*, which is somewhat more challenging. – tchrist Apr 12 '11 at 17:46
2

I was very curious about the relative speed of the various Perl-based methods submitted by other authors for this question. So, I decided to benchmark them.

Where necessary, I slightly modified each method so that it would populate an @output array, to keep the input and output consistent. I verified that all the methods produce the same @output, although I have not documented that assertion here.

Here is the script to benchmark the various methods:

#!/usr/bin/perl

use strict;
use warnings;

use Benchmark qw(cmpthese :hireswallclock);

# get a convenient list of words (on Mac OS X 10.6.6, this contains 234,936 entries)
open (my $fh, '<', '/usr/share/dict/words') or die "can't open words file: $!\n";
my @input = <$fh>;
close $fh;

# remove line breaks
chomp @input;

# set-up the tests (
my %tests = (

  # Author: cjm
  RegExp => sub { my @output = grep { not /(\w).*\1/ } @input },

  # Author: daotoad
  SplitCount => sub { my @output = grep { my @l = split ''; my %l; @l{@l} = (); keys %l == @l } @input; },

  # Author: ikegami
  NextIfSeen => sub {
    my @output;
    INPUT: for (@input) {
      my %seen;
      while (/(.)/sg) {
        next INPUT if $seen{$1}++;
      }
      push @output, $_;
    }

  },

  # Author: ysth
  BitMask => sub {
    my @output;
    for my $word (@input) {
      my $mask1 = $word x ( length($word) - 1 );
      my $mask2 = join( '', map { substr($word, $_), substr($word, 0, $_) } 1..length($word)-1 );
      if ( ( $mask1 ^ $mask2 ) !~ tr/\0// ) {
        push @output, $word;
      }
    }
  },

);

# run each test 100 times
cmpthese(100, \%tests);

Here are the results for 100 iterations.

           s/iter SplitCount    BitMask NextIfSeen     RegExp
SplitCount   2.85         --       -11%       -58%       -85%
BitMask      2.54        12%         --       -53%       -83%
NextIfSeen   1.20       138%       113%         --       -64%
RegExp      0.427       567%       496%       180%         --

As you can see, cjm's "RegExp" method is the fastest by far. It is 180% faster than the next fastest method, ikegami's "NextIfSeen" method. I suspect that the relative speed of the RegExp and NextIfSeen methods will converge as the average length of the input strings increases. But for "normal" length English words, the RegExp method is the fastest.

Sam Choukri
  • 1,874
  • 11
  • 17
  • I can report that Perl 5.14 runs those (generally) faster than 5.12 does. Using 5.14’s RC0 compared with 5.12.3, SplitCount runs in 93% of the time, BitMask in 72% of the time, NextIfSeen in 106% of the time, and RegExp in 98% of the time. – tchrist Apr 12 '11 at 21:22
1

cjm gave the regex, but here's an interesting non-regex way:

@words = qw/abducts abe abeam abel abele/;
for my $word (@words) {
    my $mask1 = $word x ( length($word) - 1 );
    my $mask2 = join( '', map { substr($word, $_), substr($word, 0, $_) } 1..length($word)-1 );
    if ( ( $mask1 ^ $mask2 ) !~ tr/\0// ) {
        print "$word\n";
    }
}
ysth
  • 96,171
  • 6
  • 121
  • 214
1

In response to cjm's solution, I wondered about how it compared to some rather terse Perl:

my @output = grep { my @l = split ''; my %l; @l{@l} = (); keys %l == @l } @input;

Since I am not constrained in character count and formatting here, I'll be a bit clearer, even to the point of over-documenting:

my @output = grep {

    # Split $_ on the empty string to get letters in $_. 
    my @letters = split '';

    # Use a hash to remove duplicate letters.
    my %unique_letters;
    @unique_letters{@letters} = ();  # This is a hash slice assignment.
                                     # See perldoc perlvar for more info

    # is the number of unique letters equal to the number of letters?
    keys %unique_letters == @letters

} @input;

And, of course in production code, please do something like this:

my @output = grep ! has_repeated_chars($_), @input;

sub has_repeated_letters {
    my $word = shift;
    #blah blah blah
    # see example above for the code to use here, with a nip and a tuck.
}
daotoad
  • 26,689
  • 7
  • 59
  • 100
  • 2
    A regex will always beat the pants off that solution. If you find yourself splitting a string into individual characters, you’re almost always going to pay a heavy price. – tchrist Apr 12 '11 at 17:35
1

In python with a regex:

python -c 'import re, sys; print "".join(s for s in open(sys.argv[1]) if not re.match(r".*(\w).*\1", s))' wordlist.txt

In python without a regex:

python -c 'import sys; print "".join(s for s in open(sys.argv[1]) if len(s) == len(frozenset(s)))' wordlist.txt

I performed some timing tests with a hardcoded file name and output redirected to /dev/null to avoid including output in the timing:

Timings without the regex:

python -m timeit 'import sys' 'print >> sys.stderr, "".join(s for s in open("wordlist.txt") if len(s) == len(frozenset(s)))' 2>/dev/null
10000 loops, best of 3: 91.3 usec per loop

Timings with the regex:

python -m timeit 'import re, sys' 'print >> sys.stderr, "".join(s for s in open("wordlist.txt") if re.match(r".*(\w).*\1", s))' 2>/dev/null
10000 loops, best of 3: 105 usec per loop

Clearly the regex is a tiny bit slower than a simple frozenset creation and len comparison in python.

freegnu
  • 793
  • 7
  • 11
-1

You can't do this with Regex. Regex is a Finite State Machine, and this would require a stack to store what letters have been seen.

I would suggest doing this with a foreach and manually check each word with code. Something like

List chars
foreach word in list
    foreach letter in word
        if chars.contains letter then remove word from list
        else
            chars.Add letter
    chars.clear
joe_coolish
  • 7,201
  • 13
  • 64
  • 111
  • 2
    I don't think Solid State means what you think it means: http://en.wikipedia.org/wiki/Solid_state_(electronics) – daotoad Apr 12 '11 at 15:55
  • 3
    Regular expressions (nominally) accept regular grammars, which means that they are equivalent to a species of state machine called a "finite state automaton". Extended regex rules in allow more powerful matching abilities that allow one to climb the Chomsky Hierarchy even to its very peak, to the amazing land of the Recursively Enumerable languages that are the sole domain Turing Complete automata. – daotoad Apr 12 '11 at 16:00
  • @daotoad: Sorry, I meant Finite State Machine. It's early. Thanks for the correction. But regardless, you can't do it. If you down voted me on that mistake, I have fixed it and I would greatly appreciate being set back to 0. The rest of my comment is valid and I even produced pseudo for how to achieve what he is looking for. All of which is clear and on topic with addition information beyond the basic "you can't do it". I don't feel that the extra effort I put into the post be neglected on a technicality – joe_coolish Apr 12 '11 at 16:06
  • 2
    Nobody (*pax* the [RE2 library](http://search.cpan.org/~dgl/re-engine-RE2-0.07/lib/re/engine/RE2.pm)) uses formal ʀᴇɢᴜʟᴀʀ ᴇxᴘʀᴇssɪᴏɴs in modern pattern matching. Indeed, even lowly POSIX regexes require more than that, since they have to be able to match `(.*)\1`. Sound like too much time in the classroom and not enough in the real world. – tchrist Apr 12 '11 at 17:38