How to select rows which are in russian?

Question

I have a table with messages, something like this

id   message
1     hi
2     ok
3     துய
4     нет

In result I want to get only the fourth row, since the first 3 ones are written in different language

a form of solution is mentioned here https://stackoverflow.com/questions/8956279/language-detection-with-data-in-postgresql — VynlJunkie, Aug 24 '21 at 12:54

Laurenz Albe · Accepted Answer · 2021-08-24T15:38:04.673

1

You could check for certain ranges of characters.

For example, this query checks if the string contains a Cyrillic letter:

SELECT id FROM mytable
WHERE message ~ '[АаБбВвГгДдЕеЁёЖжЗзИиЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщЪъЫыЬьЭэЮюЯя]';

edited Aug 24 '21 at 15:38

answered Aug 24 '21 at 13:18

Laurenz Albe

209,280
17
206
263

I get the following mistakes: 1.Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8 at or near "'[\u0400" 2.ERROR: character 0xd080 of encoding "UTF8" has no equivalent in "WIN1251" – user458 Aug 24 '21 at 15:08
Then you will have to enumerate the letters. You should use `UTF8`, however. – Laurenz Albe Aug 24 '21 at 15:38

Pierre D · Answer 2 · 2021-08-24T15:20:09.877

You may try some crude language detection via letter unigrams. This is extensible to cases where character ranges cannot be used to neatly disambiguate languages (e.g.: Latin languages).

Here is a working gist of this.

Basically:

create table charfreq (
    c text,
    f float,
    lang text
    );
insert into charfreq values
 ('а',  8.04,  'ru'),
 ('б',  1.55,  'ru'),
 ('в',  4.75,  'ru'),
...
 ('a',  8.167, 'en'),
 ('b',  1.492, 'en'),
 ('c',  2.782, 'en'),
...
 ('ï',  0.005, 'fr'),
 ('ô',  0.023, 'fr'),
 ('ù',  0.058, 'fr'),
 ('û',  0.06 , 'fr');

Then, assuming, e.g.

...
insert into test values
 (1, 'hi'),
 (2, 'ok'),
 (3, 'துய'),
 (4, 'нет'),
 (5, 'été');  -- a French message, just for fun

You can use:

select id, message, lang, score
from (
  select *, row_number() OVER (partition by id, message order by score desc) as rownum
  from (
    select id, message, lang, coalesce(sum(f), 0)/length(message) as score
    from (
      select *, unnest(STRING_TO_ARRAY(message, NULL)) as c from test
    ) as a
    left join charfreq b
    using (c)
    group by 1,2,3
  ) as a
) as a
where rownum = 1;

Outcome on the test table above:

id	message	lang	score
1	hi	en	6.53
2	ok	en	4.1395
3	துய	null	0
4	нет	ru	6.92
5	été	fr	3.41733333333333

Notes

The above is quite crude. A better approach is to use a proper unigram language model. In this other gist, we use such a model (using sum(log(P)) instead of prod(P)), and a default probability (chars not matching) arbitrarily set to 0.01%. In that gist, notice how 'les aventures' is correctly identified as French, whereas 'the adventures' is (also correctly) decided to be English, even though the same [a-z] range is used.

References

Source for English and French letters unigram frequencies.
Source for Russian letter unigram frequencies.

Just select the rows `where lang = 'ru'`. This is additive to other answers because it can be generalized to other languages as well. — Pierre D, Aug 24 '21 at 15:15

OMANSAK · Answer 3 · 2021-08-24T16:12:17.600

Table

CREATE TABLE MESSAGE (
   Id INTEGER PRIMARY KEY,
   MESSAGE VARCHAR (50)
);

Values

INSERT INTO MESSAGE VALUES (1, 'hi');//False
INSERT INTO MESSAGE VALUES (2, 'ok');//False
INSERT INTO MESSAGE VALUES (3, 'துய');//False
INSERT INTO MESSAGE VALUES (4, 'нет');//True
INSERT INTO MESSAGE VALUES (5, 'нет-_*/?/()=.,123  ');//True
INSERT INTO MESSAGE VALUES (6, 'нет  123');//True
INSERT INTO MESSAGE VALUES (6, 'нет  123AAAA');//False

Query

SELECT * FROM message m
WHERE 
ARRAY(SELECT ASCII(unnest(STRING_TO_ARRAY(REGEXP_REPLACE(M.message, '[^[:alnum:]]+', '', 'g'), NULL)))) <@ 
ARRAY(SELECT ASCII(unnest(STRING_TO_ARRAY('АаБбВвГгДдЕеЁёЖжЗзИиЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщЪъЫыЬьЭэЮюЯя0123456789', NULL))))

Cyrillic alphabets : https://en.wikipedia.org/wiki/Cyrillic_alphabets
Postgre SQL Operator : https://popsql.com/learn-sql/postgresql/how-to-compare-arrays-in-postgresql
String to ASCII : PostgreSQL - Convert string to ASCII integer
Array Function & Operators : https://www.postgresql.org/docs/12/functions-array.html

with `REGEXP_REPLACE(M.message, '[^[:alnum:]]+', '', 'g')` . I updated my answer — OMANSAK, Aug 24 '21 at 16:03

How to select rows which are in russian?

3 Answers3