How to identify a language in utf-8 column in MySQL

Question

My question is how to find specific character set from utf-8 column in MySQL server?

Please note that this is NOT Duplicate question, please read carefully what's asked, not what's you think.

Currently MySQL does works perfectly with utf-8 and shows all types of different languages and I don't have any problem to see different languages in database. I use SQLyog to connect MySQL server and all SELECT results are perfect, I can see Cyrillic, Japanese, chinese, Turkish, French or Italian or Arabic or any types of languages are mixed and shows perfectly. As well my.ini and scripts also perfectly configured and working well.

Here How can I find non-ASCII characters in MySQL? I see that some people answers the question and their answers also perfect to find non ASCII text. but my question is similar, but little different. I want to find specific character set from utf-8 column in MySQL server.

let's say,

select * from TABLE where COLUMN regexp '[^ -~]';

it returns all non ASCII characters including Cyrillic, Japanese, chinese, Turkish, French or Italian or Arabic or any types of languages. but I want is

SELECT * from TABLE WHERE COLUMN like or regexp'Japanese text only?'

another words, I want SELECT only Japanese encoded text. currently I can see all types of language with this;

select * from TABLE where COLUMN regexp '[^ -~]';

but I want select only japanese or russian or arabic or french language. how to do that?

Database contains all languages mixed rows and UTF-8. I am not sure is it possible in MySQL Server? if not possible, then how to do this?

Thanks a lot!

Possible duplicate of [Does MySQL Regexp support Unicode matching](http://stackoverflow.com/questions/14356248/does-mysql-regexp-support-unicode-matching) — Ruslan Osmanov, May 06 '16 at 03:43
It's not duplicate, read before saying it's duplicate, know what's asked and what are you showing. — Mem, May 06 '16 at 03:49

Rick James · Answer 1 · 2016-05-07T03:47:42.483

Well, let's start with a table I put in here. It says, for example, that E381yy is the utf8 encoding for Hiragana and E383yy is Katakana (Japanese). (Kanji is another matter.)

To see if a utf8 column contains Katakana, do something like

WHERE HEX(col) REGEXP '^(..)*E383'

Cyrillic might be

WHERE HEX(col) REGEXP '^(..)*D[0-4]'

Chinese is a bit tricky, but this might usually work for Chinese (and Kanji?):

WHERE HEX(col) REGEXP '^(..)*E[4-9A]'

(I'm going to change your Title to avoid the keyword 'character set'.)

Western Europe (including, but not limited to, French) C[23], Turkish (approx, and some others) (C4|C59), Greek: C[EF], Hebrew: D[67], Indian, etc: E0, Arabic/Farsi/Persian/Urdu: D[89AB]. (Always prefix with ^(..)*.

You may notice that these are not necessarily very specific. This is because of overlaps. British English and American English cannot be distinguished except by spelling of a few words. Several accented letters are shared in various ways in Europe. India has many different character sets: Devanagari, Bengali, Gurmukhi, Gujarati, etc.; these are probably distinguishable, but it would take more research. I think Arabic/Farsi/Persian/Urdu share one character set.

Some more:

| SAMARITAN                     | E0A080        | E0A0BE        |
| DEVANAGARI                    | E0A480        | E0A5BF        |
| BENGALI                       | E0A681        | E0A7BB        |
| GURMUKHI                      | E0A881        | E0A9B5        |
| GUJARATI                      | E0AA81        | E0ABB1        |
| ORIYA                         | E0AC81        | E0ADB1        |
| TAMIL                         | E0AE82        | E0AFBA        |
| TELUGU                        | E0B081        | E0B1BF        |
| KANNADA                       | E0B282        | E0B3B2        |
| MALAYALAM                     | E0B482        | E0B5BF        |
| SINHALA                       | E0B682        | E0B7B4        |
| THAI                          | E0B881        | E0B99B        |
| LAO                           | E0BA81        | E0BB9D        |
| TIBETAN                       | E0BC80        | E0BF94        |

So, for DEVANAGARI, '^(..)*E0A[45]'

Woooow!!! Thanks a lot! I was thinking it's not possible. How about French, Turkish, Greek, Hebrew , Indian, Arabic, or Urdu? — Mem, May 07 '16 at 00:46
@Crusader - `utf8mb4` acts the same as `utf8` (at least for this discussion). `utf8mb4` is needed for Emoji. There are several Emoji ranges, and most involve `F0` in the regexp. — Rick James, Mar 04 '19 at 00:21
from where you get above character sets for all different languages? by DEVANAGARI, we can get hindi and marathi both data but what to do if I only want hindi and Marathi language data seperatly @rick-james — mansi, Mar 23 '22 at 10:57
@mansi - From Quora: "Both languages use Devnagari script. But the grammar is different. There are some characters and pronunciations that are different. The way the verbs take form is different. The way the prefixes and suffixes are added is different." A somewhat similar thing happens with European character sets versus languages. — Rick James, Mar 23 '22 at 16:04
@mansi - I forget where I got the info; presumably it was an official place describing UTF-8 and Unicode. As I interpret the above quote, it would be either difficult or impossible to distinguish between Hindi Marathi _texts_ as encoded in UTF-8 (MySQL's "utf8" or "utf8mb4"). — Rick James, Mar 23 '22 at 16:07

How to identify a language in utf-8 column in MySQL

1 Answers1

Linked