2

I'm using this PHP regexp to check true/false whether a field contains a name, consisting of at least a first/last name, and then optional other middle names or initials.

$success = preg_match("/([\x{00c0}-\x{01ff}a-zA-Z'-]){2,}(\s([\x{00c0}-\x{01ff}a-zA-Z'-]{1,})*)?\s([\x{00c0}-\x{01ff}a-zA-Z'-]{2,})/ui",$user['name'],$matches);

$output[($success ? 'hits' : 'misses')][] = ['id' => $user['id'],'email' => $user['email'],'name' => $user['name'],'matches' => $matches];

Seems to work fine in terms of hits/misses, i.e. true/false whether it matches or not.

But then I'm trying to use the same thing to extract the first and last names using groups, which I'm struggling to get right..

Get lots of results like:

  "name": "Jonny Nott",
  "matches": [
    "Jonny Nott",
    "y",
    "",
    "",
    "Nott"
  ]

  "name": "Name Here",
  "matches": [
    "Name Here",
    "e",
    "",
    "",
    "Here"
  ]

  "matches": [
    "Jonathan M Notty",
    "n",
    " M",
    "M",
    "Notty"
  ]

..but what I really want is for one of the 'matches' to always contain just the first name, and one to contain always just the last name.

Any pointers as to what's wrong?

Jonny Nott
  • 328
  • 3
  • 14

3 Answers3

3

Whenever you define a capturing group in a regular expression, the part of string it matches is added as a separate item in the resulting array. There are two strategies to get rid of them:

  • Optimize the pattern and get rid of the redundant groups (e.g. groups around single atoms - (a)+ => a+)
  • Turn capturing groups into non-capturing ((\s+\w+)+ => (?:\s+\w+)+)

Also, in your case, you may enhance the patter if you replace the letter matching part with the \p{L} Unicode property class that matches any letters.

Use

/[\p{L}'-]{2,}(?:\s[\p{L}'-]+)?\s[\p{L}'-]{2,}/u

See the regex demo

Here, only one grouping is left, (?:...), and it is optional, the ? after it makes it match 1 or 0 times.

Details

  • [\p{L}'-]{2,} - 2 or more letters, ' or -
  • (?:\s[\p{L}'-]+)? - 1 or 0 occurrences of a whitespace and then 1 or more letters, ' or -
  • \s - a whitespace
  • [\p{L}'-]{2,} - 2 or more letters, ' or -
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1

Try:

(?P<firstName>[\x{00c0}-\x{01ff}a-zA-Z'-]{2,})(\s([\x{00c0}-\x{01ff}a-zA-Z'-]{1,})*)?\s(?P<lastName>[\x{00c0}-\x{01ff}a-zA-Z'-]{2,})

Main mistake You have was repeating first group {2,} - not first range

Khazul
  • 179
  • 6
1

Use non-capturing groups (?:...) whenever you have to use parenthesis but you don't want to match that part (e.g. part of spaces and middle name) and include a quantifier in capturing group, not only a characters to match (e.g. for first name {2,} should be in capturing group).

([\x{00c0}-\x{01ff}a-zA-Z'-]{2,})(?:\s(?:[\x{00c0}-\x{01ff}a-zA-Z'-]{1,})*)?\s([\x{00c0}-\x{01ff}a-zA-Z'-]{2,})
Egan Wolf
  • 3,533
  • 1
  • 14
  • 29