Check if string is subset of a bunch of characters? (RegEx)?

Question

I have a little problem, I have 8 characters, for example "a b c d a e f g", and a list of words, for example: mom, dad, bad, fag, abac

How can I check if I can or cannot compose these words with the letters I have? In my example, I can compose bad, abac and fag, but I cannot compose dad (I have not two D) and mom (I have not M or O).

I'm pretty sure it can be done using a RegEx but would be helpful even using some functions in Perl.. Thanks in advance guys! :)

It is possible to force regex to do the job, but it is better to just do counting. — nhahtdh, Jan 17 '13 at 16:06
I'm going crazy with this from hours, you mean is better in terms of performance or simplicity? — Federico Fallico, Jan 17 '13 at 16:09
I think simple logic can do the job here. You are only dealing with English alphabet, right? — nhahtdh, Jan 17 '13 at 16:10
yes, but i need to process over 500.000 words with a single string made of 10 characters, to check if I can create every single word with the 10 characters, and for each word, if true, I have to print it. — Federico Fallico, Jan 17 '13 at 16:13

score 6 · Accepted Answer · answered Jan 17 '13 at 16:59

This is done most simply by forming a regular expression from the word that is to be tested.

This sorts the list of available characters and forms a string by concatenating them. Then each candidate word is split into characters, sorted, and rejoined with the regex term .* as separator. So, for instance, abac will be converted to a.*a.*b.*c.

Then the validity of the word is determined by testing the string of available characters against the derived regex.

use strict;
use warnings;

my @chars = qw/ a b c d a e f g /;
my $chars = join '', sort @chars;

for my $word (qw/ mom dad bad fag abac /) {
  my $re = join '.*', sort $word =~ /./g;
  print "$word is ", $chars =~ /$re/ ? 'valid' : 'NOT valid', "\n";
}

output

mom is NOT valid
dad is NOT valid
bad is valid
fag is valid
abac is valid

Sure okay, I don't know if perl handles that well or not, so whatever you think :) — GONeale, Mar 05 '13 at 05:49

nhahtdh · Answer 2 · 2013-02-25T21:12:15.760

This is to demonstrate the possibility rather than endorsing the regex method. Please consider other saner solution.

First step, you need to count the number of characters available.

Then construct your regex as such (this is not Perl code!):

Start with start of input anchor, this matches the start of the string (a single word from the list):

Append as many of these as the number of unique characters:

(?!(?:[^<char>]*+<char>){<count + 1>})

Example: (?!(?:[^a]*+a){3}) if the number of a is 2.

I used an advanced regex construct here called zero-width negative look-ahead (?!pattern). It will not consume text, and it will try its best to check that nothing ahead in the string matches the pattern specified (?:[^a]*+a){3}. Basically, the idea is that I check that I cannot find 3 'a' ahead in the string. If I really can't find 3 instances of 'a', it means that the string can only contain 2 or less 'a'.

Note that I use *+, which is 0 or more quantifier, possessively. This is to avoid unnecessary backtracking.

Put the characters that can appear within []:

[<unique_chars_in_list>]+

Example: For a b c d a e f g, this will become [abcdefg]+. This part will actually consume the string, and make sure the string only contains characters in the list.

End with end of input anchor, which matches the end of the string:

So for your example, the regex will be:

^(?!(?:[^a]*+a){3})(?!(?:[^b]*+b){2})(?!(?:[^c]*+c){2})(?!(?:[^d]*+d){2})(?!(?:[^e]*+e){2})(?!(?:[^f]*+f){2})(?!(?:[^g]*+g){2})[abcdefg]+$

You must also specify i flag for case-insensitive matching.

Note that this only consider the case of English alphabet (a-z) in the list of words to match. Space and hyphen are not (yet) considered here.

oh god, crazy thing! you're a monster! :D going to try it in a moment! — Federico Fallico, Jan 17 '13 at 16:26
@FedericoFallico: It seems that I'm spreading bad coding practice by posting things that shows possibility... — nhahtdh, Jan 17 '13 at 16:30

score 1 · Answer 3 · answered Jan 17 '13 at 16:54

1

How about sorting both strings into alphabetical order then for the string you want to check insert .* between each letter like so:

'aabcdefg' =~ m/a.*b.*d.*/
True
'aabcdefg' =~ m/m.*m.*u.*/
False
'aabcdefg' =~ m/a.*d.*d.*/
False

answered Jan 17 '13 at 16:54

Dave Sexton

10,768
3
42
56

ryanm · Answer 4 · 2013-01-17T16:31:57.423

Some pseudocode:

Sort the available characters into alphabetical order
for each word:
- Sort the characters of the word into alphabetical order
  - For each character of the word search forwards through the available characters to find a matching character. Note the this search will never go back to the start of the available chars, matched chars are consumed.

Or even better, use frequency counts of characters. For your available characters, construct a map from character to occurence count of that character. Do the same for each candidate word and compare against the available map, if the word map contains a mapping for a character where the available map does not, or the mapped value is larger in the word map than the available map, then the word cannot be constructed using the available characters.

Joel Berger · Answer 5 · 2013-01-17T23:16:15.470

Here's a really simple script that would be rather easy to generalize:

#!/usr/bin/env perl

use strict;
use warnings;

sub check_word {
  my $word = shift;
  my %chars;
  $chars{$_}++ for @_;
  $chars{$_}-- or return for split //, $word;
  return 1;
}

print check_word( 'cab', qw/a b c/ ) ? "Good" : "Bad";

~~And of course the performance of this function could be greatly enhanced if the letters list is going to be the same every time.~~ Actually for eight characters, copying the hash vs building a new one each time is probably the same speed.

score -2 · Answer 6 · answered Jan 17 '13 at 16:12

-2

pseudocode:

bool possible=true
string[] chars= { "a", "b", "c"}   
foreach word in words
{
     foreach char in word.chars
     {
          possible=possible && chars.contains(char)
     }
}

answered Jan 17 '13 at 16:12

Allie

1,081
1
13
17

You didn't take the count into consideration. – nhahtdh Jan 17 '13 at 16:13
exactly, double letters are not checked.. this is ultra-easy ^^' – Federico Fallico Jan 17 '13 at 16:17

Check if string is subset of a bunch of characters? (RegEx)?

6 Answers6

This is to demonstrate the possibility rather than endorsing the regex method. Please consider other saner solution.

Linked