4

I'm trying to extract a people's names from text files, which I am reading line by line. With the way the file is structured, both the first and last name should almost always be on the same line and will be within the first few lines of the file. Currently, I search for the first name in an array of ~2300 names and then assume that the following word is the last name. My issue with my current approach is that it doesn't correctly match the names and thus may incorrectly identify a different word in the file as the name. For example, my name is Daniel, but the function skips over my name and recognizes Virginia (a word later in the file) as my first name. Am I doing anything wrong and is there a better way of doing this? I am pretty new to PHP, so chances are I'm making a silly mistake.

Clarifications: The file is a raw text file containing data that is extracted from pictures of resumes via OCR. For the purposes of my project, I am assuming that there is always a first & last name (no middle), and that both will be on the same line

$name = $this->search($line);
if (count($name) > 0 && empty($fname) && empty($lname)){
    $fname = $name[0];
    $lname = $name[1];
}

function search($str){ //$str is the current file line being read
        require "utils".DIRECTORY_SEPARATOR."dictionary-first-names.php";
        $arr = explode(" ", $str);

        for ($i = 0; $i < count($arr); $i++){
            if (in_array(mb_strtolower($arr[$i]), $dict)){
                return array($arr[$i], $arr[$i+1]); //shouldn't have array out of bounds as first & last name should be on the same line
            }
        }
    }

Here is a pastebin link to dictionary-first-names.php, since it's very long: https://pastebin.com/cRFkR4fh

Daniel
  • 476
  • 6
  • 21
  • I think it's easer to looking for a first capitalized letters. – toor Mar 14 '18 at 03:21
  • @toor Well any word in the file can be capitalized, so that wouldn't work well – Daniel Mar 14 '18 at 03:21
  • 3
    This is definitely worth reading ~ https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/ – Phil Mar 14 '18 at 03:25
  • @Daniel of course, but two capitalized words together helps you to find name and surname. – toor Mar 14 '18 at 03:28
  • @toor Maybe, but that would be foiled if the words belong to, say, the name of a store (i.e. Jersey Mike's). – Daniel Mar 14 '18 at 03:30
  • @Daniel Yes, sure, it's a problem. Maybe LISP will be better for that – toor Mar 14 '18 at 03:32
  • @Phil Interesting read. For the purposes of the project, though, I'm keeping things simple by just assuming that the names in all the files I'm analyzing always have 2 parts (First, Last). – Daniel Mar 14 '18 at 03:37
  • @toor Never used it before. A PHP solution would be the best for my situation. – Daniel Mar 14 '18 at 03:38
  • See also https://stackoverflow.com/questions/888838/regular-expression-for-validating-names-and-surnames – Raedwald Oct 01 '19 at 10:16

1 Answers1

0

You can use Named Entity Recognition (NER) methods, spacy and NLP Core are two best libraries for that purpose. But you should do that in python.