0

I am attempting to expand on this regex for listing all possible anagrams for a given set of letters:

^(?!.*([aer]).*\1)(?!(.*d){4})([aerd]*|[a-z])$

so far based on this regex, I can receive a match on any combination of words and sub-words made up of the letters 'dadder', such as 'adder', 'add', 'ad', 'red' etc. The reason for the regex complexity instead of a simple [dadder]* is because obviously each letter can be matched an infinite amount of times, which is bad, I want each letter to match the test string only once, if two d's are provided, it can match up to two times only or less. If somebody of course could streamline the regex to match any combinations of letters exactly X times specified, please feel free to provide it :)

However my main question, I would now like to incorporate a full stop character ".". If a full stop is ever encountered in the list of characters, it acts as a wildcard and could match any character a-z. So dadd.r could match daddzr, daddor, daddpr, rpdadd etc.

Could anybody help me with this?

GONeale
  • 26,302
  • 21
  • 106
  • 149
  • possible duplicate of [Check if string is subset of a bunch of characters? (RegEx)?](http://stackoverflow.com/questions/14383119/check-if-string-is-subset-of-a-bunch-of-characters-regex). There are 2 types of solutions: form regex out of the words to be tested, or form a single regex to match all the words. Disclosure: I am the author of the unpopular and crazy "form a single regex". – nhahtdh Mar 01 '13 at 11:21
  • Thanks for the link, will check it out nhahtdh – GONeale Mar 01 '13 at 11:26
  • Please follow the instruction in the post unless you know what you are doing... – nhahtdh Mar 01 '13 at 11:33
  • I did dude, I understand you generate it. I've read your post. I'm just trying to get this working in Regex Buddy first. I think this should work with my or condition at the end, but it's not :( any ideas on how to support my wildcard? `^(?!(?:[^a]*+a){3})(?!(?:[^b]*+b){2})(?!(?:[^\w]*+\w){1})(?!(?:[^c]*+c){2})(?!(?:[^d]*+d){2})(?!(?:[^e]*+e){2})(?!(?:[^f]*+f){2})(?!(?:[^g]*+g){2})([abcdefg]+|\w)$` – GONeale Mar 01 '13 at 11:34
  • `\w` I feel could have `\w{2}`, `\w{3}` etc. for as many wildcards as full stops are found. – GONeale Mar 01 '13 at 11:35

3 Answers3

1

This is not a problem that should be solved with a regex, as nhahtdh's amusing answer should convince you.

Regexes are good at matching patterns. They are not a tool for solving set-based problems, which is what you are trying to use them for.

You really need an algorithmic approach, because that is the nature of the problem. This question covers just such a topic.

Community
  • 1
  • 1
  • @nhahtdh, my assumption was that the matching *was* for generating anagrams. i.e. matching dictionary words that are part of potential anagrams. The OP stated this was for an "anagram creator". –  Mar 01 '13 at 12:28
  • Eek, let me clarify this. This is to *find* all anagrams. I called it an anagram 'creator' because my tool will *create* a list of all possible anagrams and list them for the user. – GONeale Mar 01 '13 at 12:30
  • The user can enter any amount of random letters. I am iterating 200K+ dictionary words, I wish to test if any words can be made using any or all of the user's supplied letters. – GONeale Mar 01 '13 at 12:37
  • @GONeale: The algorithm linked can be thought of as finding anagrams in a dictionary (which is your case). So on second thought, I think it may be applicable. I don't know how you would modify it, though. – nhahtdh Mar 01 '13 at 12:42
  • Well I honestly like your regex :) and I'm certain I can alter it to support wildcards. Could you try and fix what I have so far if you have time? `^(?!(?:[^a]*+a){3})(?!(?:[^b]*+b){2})(?!(?:[^c]*+c){2})(?!(?:[^d]*+d){2})(?!(?:[^e]*+e){2})(?!(?:[^f]*+f){2})(?!(?:[^g]*+g){2})([abcdefg]+|\w{1})$` This will match 'abcdefg' but not 'abcdefgz' yet even though I have the or `\w{1}`. – GONeale Mar 01 '13 at 12:44
1

The first part of the question is a duplicate of this question: Check if string is subset of a bunch of characters? (RegEx)?


This answer is dedicated to tackle the actual problem you are facing (the second part of the question).

A very simple solution would be using 2 maps: one to map the frequencies of the characters in the original set, and takes note of the number of ., the other to map the frequencies of the characters for each input string.

Pseudocode:

// I assume the maps return 0 for non existent entries
// Depending on the input, the map can simply be an array, or a tree/hash map

function checkAnagramExtended(originalString, inputString):
    if (inputString.length > originalString.length):
        return false

    // The frequency mapping for original string (ref stands for reference)
    // Ideally, refMap should be filled up once instead of every call
    // to this function
    var refMap = countFrequency(originalString)
    // The frequency mapping for input string
    var inpMap = empty map

    foreach (character c in inputString):

        if (inpMap[c] >= refMap[c]):
            // You may want to check that c is a character allowed
            // to be substituted by dot .
            // if (!canBeSubstitutedByDot(c)):
            //     return false

            if (inpMap['.'] >= refMap['.']):
                return false
            else:
                inpMap['.'] += 1

        else:
            inpMap[c] += 1

    return true

Appendix: Extending regex solution?

Your dot . extension, which allow any character from a-z to be matched makes the regex solution becomes even more impractical.

In my solution for the other problem, I relied heavily on the negative look-ahead to assert the count of a particular character is less than the maximum number of characters in the multiset of characters.

The dot . extension can vary the maximum number of characters allowed for any of the characters, thus breaks my solution above. If you force regex to do the job, it is possible to generate regex if there is only 1 ., but things explodes when you increase it to 2.

Community
  • 1
  • 1
nhahtdh
  • 55,989
  • 15
  • 126
  • 162
  • Thanks for your help nhahtdh. I'm in the middle of looking at FogleBird's answer suggested by dan1111. – GONeale Mar 01 '13 at 12:19
  • @GONeale: Wait. Do you want to match or generate? – nhahtdh Mar 01 '13 at 12:23
  • Match. I already have my word list. That's why I believe Regex can do this job. I mean, it is already, and working brilliantly on 200K+ words, I just want to expand it to support wildcards. – GONeale Mar 01 '13 at 12:25
  • (I should say, I obviously build up my Regex in my C# program, depending on the letters typed, it doesn't always look like that specific 'dadder' example. – GONeale Mar 01 '13 at 12:26
  • I have no interest in generating, I already have a dictionary of words, I just want a TRUE or FALSE if when iterating my list of dictionary words any word can be made from the supplied letters. – GONeale Mar 01 '13 at 12:32
  • @GONeale: On second though, the answer by dan1111 is applicable for the first part. But I am not sure whether it is extensible to your actual problem (I have a feeling that it is possible, though). – nhahtdh Mar 01 '13 at 12:37
  • Please see my comment on dan's answer. – GONeale Mar 01 '13 at 12:38
  • @GONeale: Does the anagram has to be the same length as the input e.g. input = `asd`, anagram = `dsa`, `ad`, `a`, etc. or is `ad` and `a` disallowed? Without this assumption, I suggest you use the algorithm above or simply sort + loop. – nhahtdh Mar 01 '13 at 12:53
  • I reckon I've nailed it, will post my regex soon. – GONeale Mar 01 '13 at 13:08
  • I got this worked out beautifully with a Regex nhahtdh, most of my time spent in the last hour was just spent building it appropriately in code. Do you want to see it? – GONeale Mar 01 '13 at 13:58
  • @GONeale: Feel free to post it as an answer. – nhahtdh Mar 01 '13 at 13:59
  • Thanks, all done, you can check it out and thanks again for your help. I knew it was possible with Regex! :) (and in my mind, still the most elegant tool for the job in this scenario) – GONeale Mar 01 '13 at 14:28
0

Ok, after much toiling around attempting to get this going as a Regex, I gave in due to incomplete wildcard support and slow processing times.

I've now converted my requirements to a C# function and I'm actually a LOT more comfortable and happier now because it's also about 400% quicker which is great.

This will check if the given word is an anagram or sub-anagram of a set of letters with wildcard support via the (.).

Where letters is the letters to test against for anagrams.

Where dictionaryData is a List<string> of words to test for.

var letterCounts = letters.Select(x => x)
  .GroupBy(x => x)
  .ToDictionary(x => x.Key, x => x.Count());

var containsWildcards = letters.IndexOf('.') >= 0;
foreach (var dictWord in dictionaryData)
{
    var matches = 0;
    var dictWordLength = dictWord.Length;
    if (dictWordLength > letters.Length)
        continue;
    var addedChars = new List<char>();
    foreach (var dictLetter in dictWord)
    {
        var foundLetter = false;
        if (letterCounts.ContainsKey(dictLetter) &&
            addedChars.Count(x => x == dictLetter) < letterCounts[dictLetter])
        {
            if (letters.IndexOf(dictLetter) >= 0)
                foundLetter = true;
        }
        else if (containsWildcards &&
            addedChars.Count(x => x == '.') < letterCounts['.'])
        {
            addedChars.Add('.');
            foundLetter = true;
        }
        if (foundLetter)
        {
            addedChars.Add(dictLetter);
            matches++;
        }
        if (dictWordLength == matches)
            break;
    }

    if (dictWordLength <= matches)
    {
        // We have a match!
    }
}

Hope it can help someone else too.

GONeale
  • 26,302
  • 21
  • 106
  • 149
  • You said that `.` is wildcard for any character, but here in your regex, `.` cannot be one of the characters that is already in the set. Your regex will exclude the cases of `deer` or `rare`, for `A.DDER`. (Note that `deer` has extra e, which can use the dot, and `rare` has extra r, which can use the dot). http://www.regex101.com/r/rS8hU5. (You can play around with the input, the regex is copy pasted from your post). – nhahtdh Mar 01 '13 at 15:57
  • Oh, that is a problem, good find :) Yeah a wildcard needs to obviously be one of your existing letters too :( – GONeale Mar 02 '13 at 00:05
  • Ok, got this going, but gave up on Regex, had to create a C# function, but I'm actually much happier now I finally gave in on regex, because it's about 400% quicker which is great. – GONeale Mar 02 '13 at 05:22
  • See what you think @nhahtdh, feel free to do some testing, but checked a stack of words and everything is checking out. Thanks again for your persistence with me in this question :) – GONeale Mar 02 '13 at 05:33
  • `dictWordLength == matches` will happen when you have looped through the whole word. So it is useless to `break`. The fact that `letterCounts[dictLetter]` larger than something already implies it is at least 1, so the extra check inside is useless. I am going to post more of my comment in pastebin... – nhahtdh Mar 02 '13 at 06:32
  • Yeah that could have been left over from an initial version I was writing, probably not needed. Unknown Paste ID when I click your link. – GONeale Mar 05 '13 at 05:47
  • I set it to expire after 1 day, I think? I found that your code does quite a lot of redundant things. And you may also want to check that the character is in the allowed set for wildcard case (think about the case of `$` in the string, `$` is not in the set `a-z`). – nhahtdh Mar 05 '13 at 06:32