3

We have an application that generates random base 35 numbers [0-9A-Z] excluding the letter O. I am looking for a solution to find codes that contain any obscene english words without having to search through a list of 10,000 entries per each generated code. Tens of Thousands of codes are generated a second, search time for these huge lists of obscene words will either crush our servers or require a lot more instances to support any meaningful decrease in performance.

Due to the nature of this code generator, obscenity checks need to be efficient and high performance.

Please note that omitting vowels is not an option as base 35 is required and mixed casing is not an option. The problem here is more than just an algorithm, efficient string matching and searching is only a small portion of the problem.

A portion of this would entail taking a list of strings and finding the commonly duplicated substrings of a given length (3 for instance) and omitting all words in a list containing those common substrings to generate an optimized list. This would help is abbreviating the lengthy obscene filter lists in the wild to solve the need here.

VulgarBinary
  • 3,520
  • 4
  • 20
  • 54
  • 2
    Do you also hand out thousands of code per second? If not: perform the validation when you hand them out. – Jeroen Vannevel May 01 '15 at 00:13
  • 2
    You are aware that obscenity filters are pretty much impossible to write with any kind of complete coverage, right? – BradleyDotNET May 01 '15 at 00:14
  • @JeroenVannevel - We hand them out at that speed, they are generated on demand. And I know, ironic right? I guess we get no choice in what our jobs ask of us ;-) – VulgarBinary May 01 '15 at 00:16
  • @BradleyDotNET - I'm not trying to thwart creative gamers here, just a computer making codes. – VulgarBinary May 01 '15 at 00:16
  • I am not sure what "filter out" means. Are you saying that the algorithm that generates the codes should never emit a code that contains an obscene character sequence (which is not truly random but is random with constraints)? Or does the filtering happen later, e.g. substituting "*" for obscene words when displaying the code somehow? The question is very vague. – John Wu May 01 '15 at 00:18
  • Is this a constant stream of generation requests? Is any user input used to generate this? If not on both questions: you could leverage less intensive periods to generate codes and once again perform validation when handing them out. – Jeroen Vannevel May 01 '15 at 00:19
  • @JohnWu - Think of it like ShortCode.Generate() and inside generate if the code is obscene it would generate a new code. So essentially a bool IsVulgar(string code) { ... answer here ... } – VulgarBinary May 01 '15 at 00:20
  • @JeroenVannevel - I had considered that, but I'd like to find a solution that would work at run time without having to stash batches of pre-filtered codes. – VulgarBinary May 01 '15 at 00:21
  • possible duplicate of [Algorithm for multiple word matching in text](http://stackoverflow.com/questions/1099985/algorithm-for-multiple-word-matching-in-text) – Tony Lee May 01 '15 at 00:23
  • Well, it's pretty hard with such requirements. Are you excluding all languages or just the 20 most commonly used ones (for example)? If every language has 50 swear words then that's in the end only a lookup into a hashtable of 1000 elements. Might be worth benchmarking this to see if it really is a dealbreaker. – Jeroen Vannevel May 01 '15 at 00:24
  • @JeroenVannevel - Only english, I'll modify the question. – VulgarBinary May 01 '15 at 00:26
  • Maybe have a background process that pre-generates these codes and inserts them into the database. You could have multiple processes so that it keeps up (provided you can make them unique between processes). As they are generated check for substrings. Even if its half as fast, using 2 processes will keep up. You could even get fancy and use a system to only pre-compute so many at a time to avoid over-inserting or getting too far ahead. – Ron Beyer May 01 '15 at 00:33
  • @JeroenVannevel - Thanks for the suggestion. It's part of what I'm hoping someone will give me is that optimized list. However, I apparently am not permitted to ask for only the list as asking for the list before got me flamed and forced me to re-ask in a way people would confuse me asking for a solution and think I only wanted an algorithm. – VulgarBinary May 01 '15 at 00:37
  • I can see why you don't want to eliminate all vowels. Have you considered eliminating sequences of characters instead? I.e. never allow more than two alphabetical characters in a row. I think the shortest English curse word is three characters. – John Wu May 01 '15 at 00:38
  • Yeah, requesting an off-site resource is in fact off-topic. You can find the most common ones pretty easily though: a quick search led me to http://www.noswearing.com/dictionary. – Jeroen Vannevel May 01 '15 at 00:39
  • @JohnWu - A solution like that is too limiting and reduces the footprint of available codes too significantly. We have length restrictions as well and the collisions between codes has to be limited. – VulgarBinary May 01 '15 at 00:40
  • @JeroenVannevel - Yea I've stumbled across many, maybe I should ask how to filter that down to a optimized list. All in all this comes down to me being lazy. I could write something to find most repeated substrings in a CSV of words and omit the duplicates containing that substring, but I thought this would be a quick answer off stack... I was apparently greatly mistaken. Now I remember why I left stack a year ago and was no longer answering or asking any questions... Too many BS rules preventing good content. – VulgarBinary May 01 '15 at 00:42

2 Answers2

3

Considering that the traditional chain of contains doesn't perform well enough in your scenario, a Trie data structure might be a good start. The Trie traversal time is quite fast. It's O(m) where m is the length of a search string. In other words, it's almost constant time if the Trie is well structured.

Here is an example in C#

Community
  • 1
  • 1
yazanpro
  • 4,512
  • 6
  • 44
  • 66
  • Wouldn't this be a colossal memory hog? Is this essentially a tree for each character to determine legitimate words? Just trying to wrap my head around this solution, So from root A - Z on the 1st generation, then children of that would determine any valid obscene second characters. The leaves of the tree would be the complete vulgar words. Then each string would character by character check the tree? – VulgarBinary May 01 '15 at 00:54
  • 2
    I've used `Tries` before and the performance gain against a lookup in a hashtable is massive. They also take up less memory than a list of the same strings because they don't duplicate characters. Your situation may be a little different though, since you want to find matches anywhere inside a search string, not just the beginning. The main benefit is that if structured correctly they can tell you when to /stop/ searching because there is no possible word that begins with the same stem. – Bradley Uffner May 01 '15 at 01:12
  • I think I could apply this logic (it's new to me, but I think I'm beginning to understand it after a bit of reading) but I still have to find an optimized list... However, building out the trie may lend itself to omitting the duplicates as well. I need to tinker with it, but this was a fairly eye opening solution that I had not considered. – VulgarBinary May 01 '15 at 01:20
1

Building off of @yazanpro's trie suggestion, it was possible to build a solution which not only can search the words but also stop additional searches where ABC needs to be omitted so there is no need to check for ABCFACE, ABCHOLE, ABCHEAD...

The logic inside the tree hydration is currently in place to format from an non-optimized list of words to construct an optimized list of words 10 to a line. Copy the output and paste back in as the new WORDS list.

The value storage for each tree node, this could be optimized to not require storage of the word. However, storing the word makes this easier to debug change IsObscene to a settable property instead of using word:

    public class ObsceneValue
{
    public bool IsObscene
    {
        get { return Word != null; }
    }

    public string Word { get; set; }
    public char Character { get; set; }

    public ObsceneValue(char character, string fullWord = null)
    {
        Character = character;
        Word = fullWord;
    }
}

The node to represent the tree structure:

    public class ObsceneNode
{
    public ObsceneNode(ObsceneValue value)
    {
        Value = value;
        Children = new Dictionary<char, ObsceneNode>();
    }

    public ObsceneValue Value { get; set; }
    public Dictionary<char, ObsceneNode> Children { get; set; }

    public bool HasChild(char character)
    {
        return Children.ContainsKey(character);
    }

    public ObsceneNode SafeAddChild(ObsceneValue value)
    {
        if (HasChild(value.Character))
        {
            return GetChild(value.Character);
        }

        var node = new ObsceneNode(value);
        Children.Add(value.Character, node);
        return node;
    }

    public ObsceneNode GetChild(char character)
    {
        if (HasChild(character))
        {
            return Children[character];
        }

        return null;
    }
}

The actual obscenity filter, which has debug logic to omit duplicates found in nearly every list of obscene words on the internet which would require redundant checks or in this case deeper than needed trees:

        public static string[] WORDS = new[]
    {
        "ANY","LIST","OF","WORDS"
    };

    private static ObsceneNode _root;

    static SmutBlocker()
    {
        //abbreviatedList can be omitted once an optimized list of
        //terms is set to WORDS.
        var abbreviatedList = new List<string>();
        _root = new ObsceneNode(new ObsceneValue(default(char), null));
        var iter = _root;
        foreach (var word in WORDS)
        {
            for(var i = 0; i < word.Length; i++)
            {
                if (iter.Value.IsObscene)
                    break;

                var isObscene = (word.Length - 1) == i;
                iter = iter.SafeAddChild(new ObsceneValue(word[i], isObscene ? word : null));
                //The below list is to capture the optimized list
                if (isObscene)
                    abbreviatedList.Add(word.ToUpper());
            }
            iter = _root;
        }

        //Purely for fixing a non-optimized list
        //Remove once list is optimized
        var output = String.Empty;
        for (var i = 0; i < abbreviatedList.Count(); i += 10)
        {
            var segment = abbreviatedList.Skip(i).Take(10);
            output += "\"" + String.Join("\",\"", segment) + "\"," + Environment.NewLine;
        }
    }

    //Finally the actual IsObscene check that loops through the tree.
    public static bool IsObscene(string text)
    {
        var iters = new List<ObsceneNode>(text.Length);
        for (var i = 0; i < text.Length; i++)
        {
            iters.Add(_root);
            var c = text[i];

            for (int j = iters.Count() - 1; j >= 0; j--)
            {
                if (iters[j].HasChild(c))
                {
                    iters[j] = iters[j].GetChild(c);
                    if (iters[j].Value.IsObscene)
                    {
                        return true;
                    }
                }
                else
                {
                    iters.RemoveAt(j);
                }
            }
        }

        return false;
    }
VulgarBinary
  • 3,520
  • 4
  • 20
  • 54