18

I am dealing with developing and Application for European Client and they have their native character set.

Now I need to have regex which would allow foreign characters like eéèêë etc and am not sure of how this can be done.

Any Suggestions ?

isherwood
  • 58,414
  • 16
  • 114
  • 157
Rachel
  • 100,387
  • 116
  • 269
  • 365

4 Answers4

21

If all you want to match is letters (including "international" letters) you can use \p{L}.

You can find some information on regex and Unicode here.

Fredrik Mörk
  • 155,851
  • 29
  • 291
  • 343
  • Should it be done like `/^[a-zA-Z ]+$/\p{L}` coz it is not working this way. – Rachel Jun 09 '10 at 21:31
  • @Rachel: You will probably need more than only `\p{L}` since this will match *only* letters (not spaces or other separators or numbers for instance). Exactly how it should looks is impossible to say without knowing the full requirements that you need to fulfill. – Fredrik Mörk Jun 10 '10 at 05:14
2

If you want to match any Latin character with an accent or diacritic mark in virtually any regular expressions engine, try:

[A-Za-zŽžÀ-ÿ]

It matches any character in the "Printable and Extended ASCII Character" sets following:

ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
ŽžÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ

Matches {char} (ASCII character index, case sensitive):

char(s) index(start) index(end)
[A-Z] 65 90
[a-z] 97 122
Ž 142 ---
ž 158 ---
[À-ÿ] 192 255

Test it at https://regex101.com/r/Xbbtm1/1

ssent1
  • 651
  • 5
  • 4
1

\p{L} isn't cross-browser yet. Transpiling down from this will give you massively bloated code if you use it a lot.

Here is a short and sweet answer to generally including non-ascii letters that doesn't add a gazillion lines of JavaScript or plugins. Replace a-zA-Z0-9 or \w in your regex with this, and don't use the u flag:

\u00BF-\u1FFF\u2C00-\uD7FF\w

This inserted into all my JavaScript regexes in place of a-zA-Z0-9 or \w, seems to do the job. My context was in the discerning of UTF-8 in HTML and CSS, and it had to be cross-browser.

I can't believe it is this simple, so am waiting to be proved wrong, after a day's searching of trying to get something to work in Firefox...

I've only tested this using Japanese hirigana with a french accent.

bob2517
  • 11
  • 4
  • 4
    Douglas Crockford has a similar letter class `[A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]` in his book JavaScript: The good parts. It includes all Unicode letters, but also thousands of characters that are not. An exact letter class of the BMP would be very large and inefficient. – yas Jan 20 '20 at 19:53
  • @dave Cool. Yeah, basically my context is to discern everything that isn't a key character in html or CSS, for manual parsing purposes. So anything that looks odd I consider to be content and not structure. For example, if someone wants to do #(insert hirgana here) { (insert hirigana)="(insert hirigana)"; }, then that's allowed and fine. I just need to detect the #, {, }, =, " and ; so I know what's going on, and that's just plain ascii and not covered in the regex. This seems to work in my case. Whether it works or not as CSS is not a problem in my specific case - it just needs to parse ok. – bob2517 Jan 21 '20 at 17:51
0

[e\xE8\xE9\xEA\xEB] will match any one of eéèêë

dlras2
  • 8,416
  • 7
  • 51
  • 90
  • What character encoding are you referring to? – Gumbo Jun 09 '10 at 21:25
  • Extended ASCII. Good catch. Should be encoded for ASCII/ANSI (according to http://www.regular-expressions.info/reference.html.) (Though it looks like `\p{L}` is still a better option.) – dlras2 Jun 09 '10 at 22:39
  • Extended ASCII is not a character set that I'm aware of. This matches up with at least Windows-1252 (ew) and ISO-8859-1. – Thanatos Jun 09 '10 at 23:42
  • http://www.asciitable.com/ I guess that's not the official name for it. It's what I run into most, tho. – dlras2 Jun 10 '10 at 04:25
  • There is no character set/encoding named Extended ASCII; it’s just a term for character sets/encodings that have US-ASCII as its base (see http://en.wikipedia.org/wiki/Extended_ASCII). I think the one you are referring to is the code page 437 (see http://en.wikipedia.org/wiki/Code_page_437). – Gumbo Jun 10 '10 at 08:52