1

I am keen on using both [^\u0000-\u007F]+ and ^[A-Za-z0-9._-](?:[A-Za-z0-9._ -]*[A-Za-z0-9._-])?$ as a one regex but it's so complicated, I just couldn't make it work? Any ideas how to integrate both?

I want to use JavaScript version for client-side verification and Php version for server-side verification.

Rough
  • 366
  • 1
  • 10
  • 2
    Mind telling us *what you intend to match* instead of just the RegEx? – Joseph Sep 02 '15 at 21:10
  • `[^\u0000-\u007F]+` this is for non-english alphabetical characters such as `àèéìòóùà` and this `^[A-Za-z0-9._-](?:[A-Za-z0-9._ -]*[A-Za-z0-9._-])?$` is for only letters, numbers, dot, dash, underscore and no whitespaces on the beginning and end. – Rough Sep 02 '15 at 21:14
  • 2
    Your question is very unclear. Provide strings that should match and those that should not match. What you tried to achieve that and what went wrong. – Wiktor Stribiżew Sep 02 '15 at 21:24
  • The two patterns are mutually exclusive, so you can't. (take a look at the ascii table.) – Casimir et Hippolyte Sep 02 '15 at 21:30
  • @Rough: not only! `[^\u0000-\u007f]` is for all that is not in the ascii table. – Casimir et Hippolyte Sep 02 '15 at 21:33
  • Just an idea: what if instead of `[^\u0000-\u007F]` we use `[\x80-\xFF]`? Try [`^[\x80-\xFF\w.-](?:[\x80-\xFF\w. -]*[\x80-\xFF\w.-])?$`](https://regex101.com/r/zI1gP8/1). – Wiktor Stribiżew Sep 02 '15 at 21:35
  • `^[A-Za-z0-9._-](?:[A-Za-z0-9._ -]*[A-Za-z0-9._-])?$` this codes helps me a lot. User can't put anything else than letters, numbers, dot, dash, underscore. Also can't use whitespaces at the beginning and end. But you know there are another alphabets with their own characters. Italian has these: `àèéìòóùà`. So I want these letters to be also valid. I came across [here](http://stackoverflow.com/questions/150033/regular-expression-to-match-non-english-characters/150078#150078) and I found `[^\u0000-\u007F]+` this code helps me. And now I'm trying to found out using them at once. – Rough Sep 02 '15 at 21:35
  • 2
    in this case add them in the class instead of trying to find a range, probably this: `^[a-zàèéìòóùà0-9._-]+(?: [a-zàèéìòóùà0-9._-]+)*$` with a case insensitive flag (but your question stay very unclear). – Casimir et Hippolyte Sep 02 '15 at 21:45
  • @stribizhev this time cant use those: `Ğ ğ Ş ş İ ı ą ć ę ł ń ó ś ź ż Ż Ź Ś Ó Ń Ł Ę Ć Ą` @Casimir et Hippolyte I can't because there are lots of letters. – Rough Sep 02 '15 at 21:49
  • 2
    @Rough: yes, the reason is that PHP regex does not support `\uXXXX` notation. However, there is [a workaround](http://ideone.com/OsOQp4). This [`^[\u0080-\uFFFF\w.-](?:[\u0080-\uFFFF\w. -]*[\u0080-\uFFFF\w.-])?$`](https://regex101.com/r/zI1gP8/2) would match those letters. In JS, this regex can be used as is in literal notation. – Wiktor Stribiżew Sep 02 '15 at 22:07
  • @stribizhev Thank you. Couldn't do it myself. Literally spent hours. Thanks again. – Rough Sep 02 '15 at 22:11

1 Answers1

2

I suggest using the remaining part of the Unicode table with [\u0080-\uFFFF] instead of [^\u0000-\u007F].

In JS, \w matches [A-Za-z0-9_], I suggest using

^[\u0080-\uFFFF\w.-](?:[\u0080-\uFFFF\w. -]*[\u0080-\uFFFF\w.-])?$

See demo

In PHP, just use \p{L} with /u modifier:

$re = '/^[\p{L}0-9_.-](?:[\p{L}0-9_. -]*[\p{L}0-9_.-])?$/u'; 
          ^^^^^           ^^^^^          ^^^^^           ^

It looks like no one likes \uXXXX in PHP. @nhahtdh confirms there may be issues with matching same code points.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Your recommendations for the first regex is terrible. If we want to match characters outside the ASCII range, you must always use `u` flag to correctly interpret the input. The solution with json_decode only happens to work, since the range encodes to UTF-8 is `\xC2\x80-\xEF\xBF\xBF`, which contains the range `\x80-\xEF`, which covers up to 3-byte UTF-8 encoded sequences. I think your answer should be edited to contain only the second solution. – nhahtdh Sep 03 '15 at 03:04
  • In PHP, if you want to specify character by code point, use `\x{hh...h}` syntax. It also works in character class, but the range is limited by the mode (in default mode, up to the size of the code unit - in PHP is 0xFF, in `u` mode, up to 0x10FFFF). – nhahtdh Sep 03 '15 at 03:05
  • `\p{L}` is a correct solution. `json_decode` on `\u0080-\uFFFF` is an unsafe solution without `u` flag - just use `\x{hh...h}` notation. – nhahtdh Sep 03 '15 at 09:01
  • Please don't try to make the same regex runs on both PHP and JS. There are fundamental differences in how they interpret the regex and the string representation in the two languages. Your idea only happens to work with this case - if the end points are different, the code may give you unwanted surprises. – nhahtdh Sep 03 '15 at 09:11