4

I'm trying to write a regular expressions that will match a set of characters without regard to order. For example:

str = "act" 
str.scan(/Insert expression here/)

would match:

cat
act
tca
atc
tac
cta

but would not match ca, ac or cata.

I read through a lot of similar questions and answers here on StackOverflow, but have not found one that matches my objectives exactly.

To clarify a bit, I'm using ruby and do not want to allow repeat characters.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Mutuelinvestor
  • 3,384
  • 10
  • 44
  • 75
  • May I clarify that you want to find cat, act in a string "cat att act, ccc". Or you want to check that "cat", "act" are permutation of act? – nhahtdh Jan 25 '13 at 03:39
  • Just note that regex is not a solution if you are trying to find substring that are permutation of a certain set of characters. – nhahtdh Jan 25 '13 at 03:43
  • I want the regular expression to match any permutation of act with repeating characters and a requirement that all characters must be utilized. – Mutuelinvestor Jan 25 '13 at 03:45
  • So you want to find "cat", "act" in "cat att act, ccc", correct? – nhahtdh Jan 25 '13 at 03:48
  • not att or ccc, as they have repeats. I just tested the solution proposed by Carl and refined by you: ^(?=[act]{3}$)(?!.*(.).*\1) and it works great. Now I just need to make sure I understand it. I haven't had a lot of experience with positive and negative look aheads but they seem pretty powerful. Thanks again! – Mutuelinvestor Jan 25 '13 at 03:55
  • possible duplicate of [Regex to match 1234, 1324, 2341 (all permutations of {1,2,3,4})](http://stackoverflow.com/questions/3101366/regex-to-match-1234-1324-2341-all-permutations-of-1-2-3-4) – Sjoerd Jan 25 '13 at 14:45

5 Answers5

5

Here is your solution

^(?:([act])(?!.*\1)){3}$

See it here on Regexr

^                  # matches the start of the string
    (?:            # open a non capturing group 
        ([act])    # The characters that are allowed and a capturing group
        (?!.*\1)   # That character is matched only if it does not occur once more, Lookahead assertion
    ){3}           # Defines the amount of characters
$

The only special think is the lookahead assertion, to ensure the character is not repeated.

^ and $ are anchors to match the start and the end of the string.

stema
  • 90,351
  • 20
  • 107
  • 135
3

[act]{3} or ^[act]{3}$ will do it in most regular expression dialects. If you can narrow down the system you're using, that will help you get a more specific answer.

Edit: as mentioned by @georgydyer in the comments below, it's unclear from your question whether or not repeated characters are allowed. If not, you can adapt the answer from this question and get:

^(?=[act]{3}$)(?!.*(.).*\1).*$

That is, a positive lookahead to check a match, and then a negative lookahead with a backreference to exclude repeated characters.

Community
  • 1
  • 1
Carl Norum
  • 219,201
  • 40
  • 422
  • 469
  • however, this would also match any character repetitions, such as aaa, aac, att, right? – georgedyer Jan 25 '13 at 00:56
  • 2
    Yes, it would. Rereading the question leaves me unclear as to whether or not that's allowed. If it's not, the question is a duplicate of http://stackoverflow.com/questions/3101366/regex-to-match-1234-1324-2341-all-permutations-of-1-2-3-4, and the accepted answer there should work fine. – Carl Norum Jan 25 '13 at 00:57
  • 1
    The solution over there only works for permutation of a set of characters, and will not work for multiset. Another thing is that this much is `^(?=[act]{3}$)(?!.*(.).*\1)` sufficient. – nhahtdh Jan 25 '13 at 03:15
  • What's a multiset? My first example supports repeated characters. – Carl Norum Jan 25 '13 at 03:17
  • @CarlNorum: A set that allows repetition of character (but you can only pick exactly the number of characters specified in the set, e.g. set of `abss` --> `bass` is OK, but `abba` is not OK). Another thing is that I doubt the expression above would work with `scan`, if I understand the function of `scan` correctly. – nhahtdh Jan 25 '13 at 03:19
2

Here's how I'd go about it:

regex = /\b(?:#{ Regexp.union(str.split('').permutation.map{ |a| a.join }).source })\b/
# => /(?:act|atc|cat|cta|tac|tca)/

%w[
  cat act tca atc tac cta
  ca ac cata
].each do |w|
  puts '"%s" %s' % [w, w[regex] ? 'matches' : "doesn't match"]
end

That outputs:

"cat" matches
"act" matches
"tca" matches
"atc" matches
"tac" matches
"cta" matches
"ca" doesn't match
"ac" doesn't match
"cata" doesn't match

I use the technique of passing an array into Regexp.union for a lot of things; I works especially well with the keys of a hash, and passing the hash into gsub for rapid search/replace on text templates. This is the example from the gsub documentation:

'hello'.gsub(/[eo]/, 'e' => 3, 'o' => '*') #=> "h3ll*"

Regexp.union creates a regex, and it's important to use source instead of to_s when extracting the actual pattern being generated:

puts regex.to_s
=> (?-mix:\b(?:act|atc|cat|cta|tac|tca)\b)

puts regex.source
=> \b(?:act|atc|cat|cta|tac|tca)\b

Notice how to_s embeds the pattern's flags inside the string. If you don't expect them you can accidentally embed that pattern into another, which won't behave as you expect. Been there, done that and have the dented helmet as proof.

If you really want to have fun, look into the Perl Regexp::Assemble module available on CPAN. Using that, plus List::Permutor, lets us generate more complex patterns. On a simple string like this it won't save much space, but on long strings or large arrays of desired hits it can make a huge difference. Unfortunately, Ruby has nothing like this, but it is possible to write a simple Perl script with the word or array of words, and have it generate the regex and pass it back:

use List::Permutor;
use Regexp::Assemble;

my $regex_assembler = Regexp::Assemble->new;
my $perm = new List::Permutor split('', 'act');
while (my @set = $perm->next) {
    $regex_assembler->add(join('', @set));
}
print $regex_assembler->re, "\n";
(?-xism:(?:a(?:ct|tc)|c(?:at|ta)|t(?:ac|ca)))

See "Is there an efficient way to perform hundreds of text substitutions in Ruby?" for more information about using Regexp::Assemble with Ruby.

Community
  • 1
  • 1
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • As a general solution, to find all matches, I'd rather use a sliding window and loop rather than regex. – nhahtdh Jan 25 '13 at 06:23
  • That's a nice idea, but a single regex will greatly out-perform the loop, and, if done correctly, can tell you which value the regex matched on. – the Tin Man Jan 25 '13 at 06:26
  • The sliding window can also tell the start and end position of the substring, so functionality wise, both should be the same. (Keeping the start and end position is most likely what is done in the regex engine). Performance is a bit questionable. – nhahtdh Jan 25 '13 at 06:29
1

I will assume several things here: - You are looking for permutations of given characters - You are using ruby

str = "act"
permutations = str.split(//).permutation.map{|p| p.join("")}

# and for the actual test
permutations.include?("cat")

It is no regex though.

scones
  • 3,317
  • 23
  • 34
1

No doubt - the regex that uses positive/negative lookaheads and backreferences is slick, but if you're only dealing with three characters, I'd err on the side of verbosity by explicitly enumerating the character permutations like @scones suggested.

"act".split('').permutation.map(&:join)
=> ["act", "atc", "cat", "cta", "tac", "tca"]

And if you really need a regex out of it for scanning a larger string, you can always:

Regexp.union "act".split('').permutation.map(&:join)
=> /\b(act|atc|cat|cta|tac|tca)\b/

Obviously, this strategy doesn't scale if your search string grows, but it's much easier to observe the intent of code like this in my opinion.

EDIT: Added word boundaries for false positive on cata based on @theTinMan's feedback.

Cade
  • 3,151
  • 18
  • 22
  • @theTinMan That's the copy/pasted output from irb. Can you be less cryptic about where you think there's a problem? – Cade Jan 25 '13 at 05:49
  • Test your regex against the OP's list of hit and miss words. – the Tin Man Jan 25 '13 at 06:07
  • @theTinMan Ah. `cata`. Thanks for pointing that out. I also didn't know about the potential embedding problems that could arise by not using `source`. Thanks for your detailed answer. I've +1'd it. – Cade Jan 25 '13 at 13:09
  • No problem. It's important to test, both here and the workplace. Regular expressions can be treacherous and leaks can occur. – the Tin Man Jan 25 '13 at 13:53
  • Just thought that I'd let you know that despite the regular expression working fine in Rubular, I couldn't get it to match in my actual application. As a result, I ended up using your solution in my application. Many thanks! – Mutuelinvestor Jan 27 '13 at 15:39