How to judge the characters of non-language characters?

Question

My title may be misunderstood, I describe it:

Can be Chinese/Japanese or any other country's language, such as 你好 or こんにちは
Can be English letters, A-Z or a-z
Can't be a symbol, such as ! or !,, or ,
Can't be special characters such as Emoji or other symbols

Can it be judged by the binary byte number of elixir, or by Unicode?

In POSIX compliant regexp there is a `:punct:` label - see the table here https://www.regular-expressions.info/posixbrackets.html but I doubt it is implemented in Elixir yet. See also https://stackoverflow.com/questions/6692540/how-can-i-create-an-alphanumeric-regex-for-all-languages — GavinBrelstaff, Mar 07 '19 at 08:05

Christophe De Troyer · Accepted Answer · 2019-03-08T12:43:39.033

If I understood your question well, you want to check if a given string contains Chinese/Japanese characters, or alphabetical characters. but not punctuation or emoji?

For the Asian characters you can use the CJK range from Unicode, which might close enough. You can always check more ranges for languages you want to (dis)allow.

So the first step would be to check if a given code point is in the CJK range(s):

  def is_in_range?(cp) do
    ranges = [
      {"\u4E00", "\u9FEF"},
      {"\u3400", "\u4DBF"},
      {"\u20000", "\u2A6DF"},
      {"\u2A700", "\u2B73F"},
      {"\u2B740", "\u2B81F"},
      {"\u2B820", "\u2CEAF"},
      {"\u2CEB0", "\u2EBEF"},
      {"\u3007", "\u3007"}
    ]

    # Check if the codepoint is any of the ranges above.
    ranges
    |> Enum.map(fn {s, e} ->
      cp >= s and cp <= e
    end)
    |> Enum.any?()
  end

If we have that function, we can check for any given string if it contains any of these characters:

  def contains_cjk(str) do
    str |> String.codepoints() |> Enum.map(&is_in_range?/1) |> Enum.any?()
  end

If you want to compare alpha characters you can use a regular regex, or just add the range from A-Z, and a-z (\u0061 to \u007A, and \u0041 to \u005A). For example, your second string (こんにちは) its first code point is in the "Hiragana" code block. You could add the range (\u3040 to \u309F) to also allow these characters. A listing of blocks can be found here.

A note on performance is in place here. This code is not linear, as for n characters it will do #amount_of_chars_in_range comparisons.

How to judge the characters of non-language characters?

1 Answers1