I'm trying to detect in a text if there are characters belonging to the writing system of a language without word boundaries. These writing systems are the following according to Wikipedia (I have added the ISO 639-2 or 639-3 code)
Burmese MY
Chinese ZH
Japanese JA
S'gaw Karen KAR
Khmer KM
Lao LP
ʼPhags-pa PHAG
Pwo Karen PWO
S'gaw Karen KAR
Tai Tham LANA
Thai TH
Tibetan BO
In the case of Chinese
I'm using a specific regex for Han
writing system:
HAN_REGEX = /[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FD5\uF900-\uFA6D\uFA70-\uFAD9]/;
as an equivalent to \p{Han}
. An alternative solution for Chinese hieroglyphs is to use directly
let regexp = /\p{sc=Han}/gu;
So let's say given the Kanji
Unicode Table, the charset range to detect JA
in the text is this one:
KANJI_REGEX = /[\u3000-\u303f\u3040-\u309f\u30a0-\u30ff\uff00-\uff9f\u4e00-\u9faf\u3400-\u4dbf]/
but what about the other writing systems? Is the charset range the only way?