2

I am new to regex and I have been going round and round on this problem.

PHP: Check alphabetic characters from any latin-based language? gives the brilliant regex to check for any characters in the Latin script, which is part of what I need.

^\p{Latin}+$

and provides a working example at https://regex101.com/r/I5b2mC/1

If I use the regex in PHP by using

echo preg_match('/^\p{Latin}+$/', $testString);

and $testString contains only Latin letters, the output will be 1. If there is any non-Latin letters, the output will be 0. Brilliant.

To add numbers in I tried ^\p{Latin}+[[:alnum:]]*$ but that allows any characters in the Latin script OR non-Latin letters and numbers (letters without accents — grave, acute, cedilla, umlaut etc.) as it is the equivalent to [a-zA-Z0-9].

If you add any numbers with characters in the Latin script, echo preg_match('/^\p{Latin}+[[:alnum:]]*$/', $testString); returns a 0. All numbers return a 0 too. This can be confirmed by editing the expression in https://regex101.com/r/I5b2mC/1

How do I edit the expression in echo preg_match('/^\p{Latin}+$/', $testString); to output a 1 if there are any characters in the Latin script, any numbers and/or spaces in $testString? For example, I wish for a 1 to be output if $testString is Café ßüs 459.

Chris Rogers
  • 370
  • 3
  • 22

2 Answers2

2

There are at least two things to change:

  • Add u flag to support chars other than ASCII (/^\p{Latin}+$/ => /^[\p{Latin}]+$/u)
  • Create a character class for letters, digits and whitespace patterns (/^\p{Latin}+$/u => ^[\p{Latin}]+$/u)
  • Then add the digit and whitespace patterns. If you need to support any Unicode digits, add \d. If you need to support only ASCII digits, add 0-9.

Thus, you can use

preg_match('/^[\p{Latin}\s0-9]+$/u', $testString) // ASCII only digits
preg_match('/^[\p{Latin}\s\d]+$/u', $testString)  // Any digits

Also, \s with u flag will match any Unicode whitespace chars.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1

More generally, it is possible to prohibit any string containing letters that are not Latin (without to add one by one characters or groups of characters you want to allow):

$re = '~ ^ (?! .* [^\PL\p{Latin}] ) .+ $ ~mux';

demo

If you want strings with at least one Latin letter (and no letters from other alphabets), you can use a script run to build your pattern:

$re = '~ ^ [^\pL\r\n]* (?= \p{Latin} ) (*sr: .+ ) $ ~mux';

demo

These two solutions may be more flexible. Obviously it all depends on the goal.

More about script runs here.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125