3

So, the requirement for this is to match last names of people, separated by a dash between each last name.

The base RegEx I am using for this is this one:

(?=\S*[-])([a-zA-ZÑñÁáÉéÍíÓóÚúÄäËëÏïÖöÜüÀàÈèÌìÒòÙù'-]+)

Basically I am limiting it to latin alphabet characters, including some accented characters.

This works perfectly fine if I use examples like:

  • Pérez-González
  • Domínguez-Díaz
  • Güemez-Martínez

But I forgot to contemplate the case when the person has only one last name.

I tried doing the following.

((?=\S*[-])([\ a-zA-ZÑñÁáÉéÍíÓóÚúÄäËëÏïÖöÜüÀàÈèÌìÒòÙù'-]+))|([A-Za-zÑñÁáÉéÍíÓóÚúÄäËëÏïÖöÜüÀàÈèÌìÒòÙù']+)

I added a \ or space in the allowed character for the fist match option. I added an or condition for a single word without spaces.

And while it works for some cases there are 2 issues.

  1. I don't think it's the most optimal RegEx for a use case like this.
  2. I stumbled upon the specific case with people who have complex last names.

Regarding point 2, I refer to something like:

  • Johnson-De Sosa

The RegEx matches it, but it no longer respects the dash as a separator.

I am not sure how to handle this.

Also since I added the space it no longer respects the requirement for the dash between words.

What I am thinking is maybe limit the number of spaces between names, something like allow at most 2 or 3 spaces between a last name so that examples like:

  • Pérez-De la Cruz - this works with my RegEx
  • Pérez De la Cruz-González - this doesn't

Can be valid matches.

I am no pro on RegEx so some help would be greatly appreciated.

UPDATE

I did fail to mention I need to be able to use this with JavaScript. PHP could be useful too, but I am doing some browser validation and the patterns need to be compatible.

Mihail Minkov
  • 2,463
  • 2
  • 24
  • 41
  • Is the surname/last name the only thing in the string? Identifying a surname/last name can be very difficult in plain text. Maybe `[[:alpha:]]+([- ']?)` is simpler, seems to match all examples but also is very loose, e.g. `asdfl` is not a surname. – user3783243 Apr 12 '21 at 00:57
  • Yes, it's specifically for last names. That's why I am not sure how to proceed with the more complex ones. I don't know if I should do a bunch of `OR` conditions inside the RegEx or just simplify the required input. – Mihail Minkov Apr 12 '21 at 01:30
  • @user3783243 the `[[:alpha:]]+([- ']?)` recommendation is an interesting one, but it doesn't work with the accented characters. – Mihail Minkov Apr 12 '21 at 01:32
  • Use the `u` flag and it should extend to accented. So https://regex101.com/r/LsOqVr/1/ could achieve your goal – user3783243 Apr 12 '21 at 01:48
  • @user3783243 that last one I think does the trick, but when I try it in JavaScript on regex it shows a pattern error. – Mihail Minkov Apr 12 '21 at 03:28

1 Answers1

1

Logically, you should match one or more letters, then allow a single occurrence of your chosen delimiting characters before allowing another string of one or more letters.

PHP Code: (Demo)

$names = [
    'Pérez-González',
    'Domínguez-Díaz',
    'Güemez-Martínez',
    'Johnson-De Sosa',
    'Pérez-De la Cruz',
    'smith',
    'Pérez De la Cruz-González',
    'de Gal-O\'Connell',
    'Johnson--Johnson'
];

foreach ($names as $name) {
    echo "$name is " . (!preg_match("~^\pL+(?:[- ']\pL+)*$~u", $name) ? 'in' : '') . "valid\n";
}

Javascript Code: (snippet is runnable)

let names = [
      'Pérez-González',
      'Domínguez-Díaz',
      'Güemez-Martínez',
      'Johnson-De Sosa',
      'Pérez-De la Cruz',
      'smith',
      'Pérez De la Cruz-González',
      'de Gal-O\'Connell',
      'Johnson--Johnson'
    ],
    i,
    name;

for (i in names) {
    name = names[i];
    document.write("<div>" + name + " is " + (!name.match(/^\p{L}+(?:[- ']\p{L}+)*$/u) ? 'in' : '') + "valid</div>");
}

This will only allow a single delimiter between sequences of letters. This will fail if you someone's name is "Suzy 'Ng" because it has a space then an apostrophe (two consecutive delimiters). I don't know if this is possible/real, I just want to clarify.

No lookarounds are necessary.

mickmackusa
  • 43,625
  • 12
  • 83
  • 136
  • If someone find a good duplicate let me know and I'll delete this answer. This page demonstrates very poor validation https://stackoverflow.com/q/19055787/2943403. There is a mix of good and bad advice at https://stackoverflow.com/q/275160/2943403. This is relevant https://stackoverflow.com/q/58964670/2943403 – mickmackusa Apr 12 '21 at 02:19
  • This is worth linking to https://stackoverflow.com/q/888838/2943403. Here is a page with borderline sarcastic insights https://stackoverflow.com/q/5105244/2943403. – mickmackusa Apr 12 '21 at 02:26