0

The closest existing question I have found is this or this

I would like to write a function or class that accepts a string and then based on whatever criteria can be programmed into it will return the probability that it is a real human name. At the moment I would expect it to be heavily biased toward English or European names or English transliterations of other names. (for example, "bob", "bob smith", and "smith" should all return 1.0 and "sfgoisxdzzg" should return something like .001 or even .0000001)

Does anyone know if this is already done / being done? (even if in another language) My first thought was that I'd have to do some sort of machine learning script. My problem with that is my complete ignorance of any machine learning theory.

So, the second part of my question is this: Is machine learning a viable option for tackling this problem? If so, what resources should I start with to learn how to do it? IF not, can you point me in the right direction?

Community
  • 1
  • 1
TecBrat
  • 3,643
  • 3
  • 28
  • 45
  • The accepted answer for your first referenced question is what you should take as an answer to this. It's up to yourself/administration to monitor the database and issue a bad/punishment/penalty for invalid names – Daryl Gill Mar 28 '13 at 03:04
  • Out of interest why would you want to do this? – Jim Mar 28 '13 at 03:08
  • You can take a look at [Levenshtein](http://php.net/manual/en/function.levenshtein.php) and the other similar functions linked from there. Like the BCS bowl selection, just toss them through a bunch of different tests and see what you end up with. You would probably need some archetypes, however. – Jared Farrish Mar 28 '13 at 03:08
  • This question may also be of use: http://stackoverflow.com/a/6298193/505722 – Jim Mar 28 '13 at 03:10
  • Maybe facebook has implement a idea for name check. If a part uses special char or dictionary word then it's invalid. –  Mar 28 '13 at 05:53
  • @Jim My initial reason is because I have contact forms that get names like gyjSFjXJHjtgfgc. I already have some spam tests in place, but I thought that it would make an interesting side project where I might be able to teach myself something. – TecBrat Mar 28 '13 at 12:25

1 Answers1

2

This Bayesian approach that I use for filtering with quite a bit of success on a contact submission and a request for quote forms. The form is using scoring and handles requests from all over the world in various languages. If they fail 3 or 4 tests on various fields only then do I mark them as a Spam attempt. Obviously things like '123456' throw up a red flag instantly for a phone number. Also BBCode in the comments is a dead giveaway.

<?php
function nameCheck($var) {
        $nameScore = 0;
        //If name < 4 score + '3'
        $chars_count = strlen($var);
        $consonants = preg_replace('![^BCDFGHJKLMNPQRSTVWXZ]!i','',$var);
        $consonant_count = strlen($consonants);
        $vowels = preg_replace('![^AEIOUY]!i','',$var);
        $vowel_count = strlen($vowels);
        //We're expecting first and last name.
        if ($chars_count < 4){
            $nameScore = $nameScore + 3;    
        }

        //if name > 4 and no spaces score + '4'
        if (($chars_count > 4)&& (!preg_match('![ ]!',$var))){
            $nameScore = $nameScore + 4;    
        }

        if (($chars_count > 4)&&(($consonant_count==0)||($vowel_count==0))){
            $nameScore = $nameScore + 5;            
        }

        //if name > 4 and vowel to consonant ratio < 1/8 score + '5'
        if (($consonant_count > 0) && ($vowel_count > 0) && ($chars_count > 4) && ($vowel_count/$consonant_count < 1/8)){
            $nameScore = $nameScore + 5;    
        }
        //Needs at least 1 letter.
        if (!preg_match('![A-Za-z]!',$var)){
            $nameScore = $nameScore + 10;           
        }

        return $nameScore;
    }

//added for testing
$var = $_GET['email'];
echo nameCheck($var);
?>

Even if someone flushes I have it copy me on the attempt so I can fix my scoring. There are a few false-positives usually in Chinese or Korean, but for the most part anyone who completes the form in English will pass. Names like "Wu Xi" do exist.

AbsoluteƵERØ
  • 7,816
  • 2
  • 24
  • 35
  • This is similar in concept to how I currently deal with potential spam. I'll give some time for other answers, but I'll probably accept this one. – TecBrat Mar 28 '13 at 12:31
  • Since reading your answer, I have been studing the word "Bayesian" and I am very intrigued. This might be a whole new area for me to direct some learning. THANKS! [Spam Filtering](http://en.wikipedia.org/wiki/Bayesian_spam_filtering) I was already doing this, without knowing the word but now I know what to search for to find more resources on it. – TecBrat Mar 28 '13 at 15:25
  • As a side note, I also found this [Gibberish Detector](https://github.com/buggedcom/Gibberish-Detector-PHP) that would do the trick for me. It used a novel as the training text and I suspect a person could use a name-list instead. – TecBrat Mar 29 '13 at 02:52