If I understood your question well, you want to check if a given string contains Chinese/Japanese characters, or alphabetical characters. but not punctuation or emoji?
For the Asian characters you can use the CJK range from Unicode, which might close enough. You can always check more ranges for languages you want to (dis)allow.
So the first step would be to check if a given code point is in the CJK range(s):
def is_in_range?(cp) do
ranges = [
{"\u4E00", "\u9FEF"},
{"\u3400", "\u4DBF"},
{"\u20000", "\u2A6DF"},
{"\u2A700", "\u2B73F"},
{"\u2B740", "\u2B81F"},
{"\u2B820", "\u2CEAF"},
{"\u2CEB0", "\u2EBEF"},
{"\u3007", "\u3007"}
]
# Check if the codepoint is any of the ranges above.
ranges
|> Enum.map(fn {s, e} ->
cp >= s and cp <= e
end)
|> Enum.any?()
end
If we have that function, we can check for any given string if it contains any of these characters:
def contains_cjk(str) do
str |> String.codepoints() |> Enum.map(&is_in_range?/1) |> Enum.any?()
end
If you want to compare alpha characters you can use a regular regex, or just add the range from A-Z, and a-z (\u0061
to \u007A
, and \u0041
to \u005A
). For example, your second string (こんにちは
) its first code point is in the "Hiragana" code block. You could add the range (\u3040
to \u309F
) to also allow these characters. A listing of blocks can be found here.
A note on performance is in place here. This code is not linear, as for n
characters it will do #amount_of_chars_in_range
comparisons.