I have a need for a javascript regex that would match words in any language, but fail for emoji or any other character. Solution here: Regular expression to match non-English characters? matches all letters plus pictograms and emoji ([^\u0000-\u007F]+
).
Modifying it a bit seems to accomplish what I need, but I'm not sure how safe it is: ([a-zA-Z]|[^\u0000-\u007F\u200d-\u3299\ud83c-\udfff\ufe0e\ufe0f])+
Example:
America
Österreich
Россия
Ελλάδα
Should only match letters and stop before emoji. Should not match emojis with letter representations, for example: 1️⃣#️⃣*️⃣
Relevant: http://www.unicode.org/Public/emoji/5.0/emoji-variation-sequences.txt
Bit of context:
I'm trying to patch this parser: https://github.com/Khan/simple-markdown/blob/master/simple-markdown.js#L1304 to break on emojis, because currently it matches as much text as it can. Without that matching/replacing emoji via that parser is problematic. Removing \u00c0-\uffff
from the highlighted regex accomplishes what I need, but parser starts breaking up words. Some languages (cyrrillic) get broken per letter, which is not good for performance. I need to either patch that regex to allow letters, but not emojis, or put a regex that catches all text before it.
Edit: Added some examples
Edit: Added language restriction