5

I am provided a list of names in upper case. For the purpose of a salutation in an email I would like them them to be Proper Cased.

Easy enough to do using PHP's ucwords. But I feel I need some regex function to handle common exceptions, such as:

"O'Hara", "McDonald", "van der Sloot", etc

It's not so much that I need help constructing a regex statement to handle the three examples above (tho that would be nice), as it is that I don't know what all the common exceptions might be.

Surely someone has faced this issue before, any pointers to published solutions or something you could share?

AllInOne
  • 1,450
  • 2
  • 14
  • 32
  • 1
    So, you don't need help with the code... just a list of names? – vcsjones Jul 17 '12 at 19:13
  • 3
    This is quite difficult as things like MacDonald and Macdonald are both proper spellings of a last name and it depends on the person on how they case it. – John Sobolewski Jul 17 '12 at 19:14
  • 3
    Why not have the user enter their own name? – Baylor Rae' Jul 17 '12 at 19:15
  • @vcsjones I guess my hope was that someone has already determined that there are say, 25, common special cases and written a regular expression to handle each of them. – AllInOne Jul 17 '12 at 19:16
  • Also what about MacDonald vs Mackay? – John Sobolewski Jul 17 '12 at 19:17
  • @jsobo agreed. For this purpose tho I am trying to get from unacceptable (O'HARA), past barely acceptable (O'hara), to commonly accepted (O'Hara). Doesn't have to be perfect tho. – AllInOne Jul 17 '12 at 19:19
  • 2
    @AillInOne similar thread here: http://forums.devshed.com/php-development-5/capitalization-issues-435695.html – Jeff Lambert Jul 17 '12 at 19:20
  • 1
    You might want to look into Named Entity Recognition: http://en.wikipedia.org/wiki/Named-entity_recognition – polm23 Nov 05 '12 at 01:10
  • There are tons and tons of exceptions, some of which you won't be able to deal with, because it's entirely down to the choice of the person whose name it is. For example, there's a "Georges de La Tour" and a "Frances de la Tour". Whether the "l" of "la" is capitalised is entirely arbitrary; there is no rule. – Matt Gibson Mar 22 '16 at 11:37

3 Answers3

2

Using regular expressions in a short provided list could be easy, but if you must handle hundreds or thousands of records it's very hard to be bullet proof.

I'd rather use something that can't affect someone else. How do you know if Mr. "MACDONALD" prefers "Macdonald"?

You're correcting someone else's error. If source cannot be corrected you could use something like this:

<?php

$provided_names = array(
  "SMITH",
  "O'HARA",
  "MCDONALD",
  "JONES",
  "VAN DER SLOOT",
  "MACDONALD"
);

$corrected_names = array(
  "O'HARA"        => "O'Hara",
  "MCDONALD"      => "McDonald",
  "VAN DER SLOOT" => "van der Sloot"
);

$email_text = array();

foreach ($provided_names as $provided_name)
{
  $provided_name = !array_key_exists($provided_name, $corrected_names) 
    ? ucwords(strtolower($provided_name)) 
    : $corrected_names[$provided_name];
  $email_text[]  = "{$provided_name}, your message text.";
}

print_r($email_text);

/* output:
Array
(
  [0] => Smith, your message text.
  [1] => O'Hara, your message text.
  [2] => McDonald, your message text.
  [3] => Jones, your message text.
  [4] => van der Sloot, your message text.
  [5] => Macdonald, your message text.
)
*/
?>

I hope it be useful.

quantme
  • 3,609
  • 4
  • 34
  • 49
  • 1
    I've been thinking about this some more and think yours is part of an interesting approach. What if the $corrected_names array were generated as follows: pull every name we can find (say from a phone directory or the census), where there is more than one capitalization pattern for the name retain only the most popular. That way every name would be "corrected" with capitalization in the most common pattern for that name. Perfect? Certainly not; but I am trying not to let perfect be the enemy of good. – AllInOne Jul 18 '12 at 00:16
  • I was thinking about it and the way I would do it could be motivating (using an email form or phone call, if it's possible) to customer/client/marketing department to review the personal information; offering, may be, a kind of award like a discount/gift. – quantme Aug 28 '12 at 14:21
2

I wrote this today to implement in an app I'm working on. I think this code is pretty self explanatory with comments. It's not 100% accurate in all cases but it will handle most of your western names easily.

Examples:

mary-jane => Mary-Jane

o'brien => O'Brien

Joël VON WINTEREGG => Joël von Winteregg

jose de la acosta => Jose de la Acosta

The code is extensible in that you may add any string value to the arrays at the top to suit your needs. Please study it and add any special feature that may be required.

function name_title_case($str)
{
  // name parts that should be lowercase in most cases
  $ok_to_be_lower = array('av','af','da','dal','de','del','der','di','la','le','van','der','den','vel','von');
  // name parts that should be lower even if at the beginning of a name
  $always_lower   = array('van', 'der');

  // Create an array from the parts of the string passed in
  $parts = explode(" ", mb_strtolower($str));

  foreach ($parts as $part)
  {
    (in_array($part, $ok_to_be_lower)) ? $rules[$part] = 'nocaps' : $rules[$part] = 'caps';
  }

  // Determine the first part in the string
  reset($rules);
  $first_part = key($rules);

  // Loop through and cap-or-dont-cap
  foreach ($rules as $part => $rule)
  {
    if ($rule == 'caps')
    {
      // ucfirst() words and also takes into account apostrophes and hyphens like this:
      // O'brien -> O'Brien || mary-kaye -> Mary-Kaye
      $part = str_replace('- ','-',ucwords(str_replace('-','- ', $part)));
      $c13n[] = str_replace('\' ', '\'', ucwords(str_replace('\'', '\' ', $part)));
    }
    else if ($part == $first_part && !in_array($part, $always_lower))
    {
      // If the first part of the string is ok_to_be_lower, cap it anyway
      $c13n[] = ucfirst($part);
    }
    else
    {
      $c13n[] = $part;
    }
  }

  $titleized = implode(' ', $c13n);

  return trim($titleized);
}
gillytech
  • 3,595
  • 2
  • 27
  • 44
2

I wrote a small lib for this: https://github.com/tamtamchik/namecase You can install it with Composer.

For your inputs it produces exactly what you need using the following code:

<?php

require_once 'vendor/autoload.php'; // Composer autoload

$arr = ["O'HARA", "MCDONALD", "VAN DER SLOOT"];

foreach ($arr as $name) {
    echo $name . ' => ' . str_name_case($name) . PHP_EOL;
}

Call function str_name_case that is shipped with a lib on any name string, and it'll be converted to proper case. For your examples output will become:

O'HARA => O'Hara
MCDONALD => McDonald
VAN DER SLOOT => van der Sloot

Iurii Tkachenko
  • 3,106
  • 29
  • 34
  • Yes - that is a much better answer now, thanks for taking the advice. What would your library do with a name like ["MACDONALD"](https://en.wikipedia.org/wiki/Macdonald), which has two forms of capitalization? – Mogsdad Mar 22 '16 at 13:47
  • 1
    @Mogsdad by default it'll convert to `MacDonald` but I might add an option for this if there would be the feature request. I'm trying to stick to original Perl version https://metacpan.org/pod/distribution/Lingua-EN-NameCase/README and MacDonald is not an exception there. – Iurii Tkachenko Mar 22 '16 at 13:54