4

I have a need for a javascript regex that would match words in any language, but fail for emoji or any other character. Solution here: Regular expression to match non-English characters? matches all letters plus pictograms and emoji ([^\u0000-\u007F]+).

Modifying it a bit seems to accomplish what I need, but I'm not sure how safe it is: ([a-zA-Z]|[^\u0000-\u007F\u200d-\u3299\ud83c-\udfff\ufe0e\ufe0f])+

Example: America Österreich Россия Ελλάδα

Should only match letters and stop before emoji. Should not match emojis with letter representations, for example: 1️⃣#️⃣*️⃣

Relevant: http://www.unicode.org/Public/emoji/5.0/emoji-variation-sequences.txt

Bit of context: I'm trying to patch this parser: https://github.com/Khan/simple-markdown/blob/master/simple-markdown.js#L1304 to break on emojis, because currently it matches as much text as it can. Without that matching/replacing emoji via that parser is problematic. Removing \u00c0-\uffff from the highlighted regex accomplishes what I need, but parser starts breaking up words. Some languages (cyrrillic) get broken per letter, which is not good for performance. I need to either patch that regex to allow letters, but not emojis, or put a regex that catches all text before it.

Edit: Added some examples

Edit: Added language restriction

Max
  • 1,149
  • 3
  • 10
  • 20

3 Answers3

3

I found a solution here: https://mathiasbynens.be/notes/es-unicode-property-escapes#word

Essentially /[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]/u given Unicode property escapes support.

Until \p is natively supported in JavaScript, you can transpile this regex.

Mathias Bynens
  • 144,855
  • 52
  • 216
  • 248
Max
  • 1,149
  • 3
  • 10
  • 20
0

\pL matches a Unicode letter.

You might want to combine that Unicode category with \p{Pc} (connector punctuation) to also catch word combinations like it's or doesn't by using a character class: [\pL\p{Pc}]

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • Thank you, I forgot to add in the text that I need it in javascript. Your solution would be fine, but not for JS :( – Max Jun 28 '17 at 10:17
  • Ah, sorry, I somehow thought I had read Java...in that case, grab Steve Levithan's XRegExp library (with Unicode plugins)](http://xregexp.com/plugins/). – Tim Pietzcker Jun 28 '17 at 10:30
0

In JavaScript before ES2018 (which got added natively to many browsers in mid-2020), the answer is "roll your own"

Here is what I made, after consulting Wikipedia and using this SO answer for cleaning up the endless list of unicode codes:

const westernEurope = '\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u01BF';
// (u00D7 and u00F7 are math symbols)
const cyrillic = '\u0400-\u04FF';
const japan = '\u30A0-\u30FF';
const chinese = '\u4E00-\u9FA5';

const re = new RegExp(`^[a-zA-Z${westernEurope + cyrillic + japan + chinese}]*$`, 'g');

You should also consult Wikipedia if you need other languages or want to double check this (for instance, I only included basic Cyrillic in the cyrillic codes above)

If you can use the latest JavaScript in your project, this answer explains how Unicode Property Escapes are just what we need

Mad Bernard
  • 363
  • 1
  • 7
  • 16