1

I have a table with messages, something like this

id   message
1     hi
2     ok
3     துய
4     нет

In result I want to get only the fourth row, since the first 3 ones are written in different language

user458
  • 369
  • 1
  • 9

3 Answers3

1

You could check for certain ranges of characters.

For example, this query checks if the string contains a Cyrillic letter:

SELECT id FROM mytable
WHERE message ~ '[АаБбВвГгДдЕеЁёЖжЗзИиЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщЪъЫыЬьЭэЮюЯя]';
Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • I get the following mistakes: 1.Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8 at or near "'[\u0400" 2.ERROR: character 0xd080 of encoding "UTF8" has no equivalent in "WIN1251" – user458 Aug 24 '21 at 15:08
  • Then you will have to enumerate the letters. You should use `UTF8`, however. – Laurenz Albe Aug 24 '21 at 15:38
1

You may try some crude language detection via letter unigrams. This is extensible to cases where character ranges cannot be used to neatly disambiguate languages (e.g.: Latin languages).

Here is a working gist of this.

Basically:

create table charfreq (
    c text,
    f float,
    lang text
    );
insert into charfreq values
 ('а',  8.04,  'ru'),
 ('б',  1.55,  'ru'),
 ('в',  4.75,  'ru'),
...
 ('a',  8.167, 'en'),
 ('b',  1.492, 'en'),
 ('c',  2.782, 'en'),
...
 ('ï',  0.005, 'fr'),
 ('ô',  0.023, 'fr'),
 ('ù',  0.058, 'fr'),
 ('û',  0.06 , 'fr');

Then, assuming, e.g.

...
insert into test values
 (1, 'hi'),
 (2, 'ok'),
 (3, 'துய'),
 (4, 'нет'),
 (5, 'été');  -- a French message, just for fun

You can use:

select id, message, lang, score
from (
  select *, row_number() OVER (partition by id, message order by score desc) as rownum
  from (
    select id, message, lang, coalesce(sum(f), 0)/length(message) as score
    from (
      select *, unnest(STRING_TO_ARRAY(message, NULL)) as c from test
    ) as a
    left join charfreq b
    using (c)
    group by 1,2,3
  ) as a
) as a
where rownum = 1;

Outcome on the test table above:

id message lang score
1 hi en 6.53
2 ok en 4.1395
3 துய null 0
4 нет ru 6.92
5 été fr 3.41733333333333

Notes

The above is quite crude. A better approach is to use a proper unigram language model. In this other gist, we use such a model (using sum(log(P)) instead of prod(P)), and a default probability (chars not matching) arbitrarily set to 0.01%. In that gist, notice how 'les aventures' is correctly identified as French, whereas 'the adventures' is (also correctly) decided to be English, even though the same [a-z] range is used.

References

  • Source for English and French letters unigram frequencies.
  • Source for Russian letter unigram frequencies.
Pierre D
  • 24,012
  • 7
  • 60
  • 96
0

Table

CREATE TABLE MESSAGE (
   Id INTEGER PRIMARY KEY,
   MESSAGE VARCHAR (50)
);

Values

INSERT INTO MESSAGE VALUES (1, 'hi');//False
INSERT INTO MESSAGE VALUES (2, 'ok');//False
INSERT INTO MESSAGE VALUES (3, 'துய');//False
INSERT INTO MESSAGE VALUES (4, 'нет');//True
INSERT INTO MESSAGE VALUES (5, 'нет-_*/?/()=.,123  ');//True
INSERT INTO MESSAGE VALUES (6, 'нет  123');//True
INSERT INTO MESSAGE VALUES (6, 'нет  123AAAA');//False

Query

SELECT * FROM message m
WHERE 
ARRAY(SELECT ASCII(unnest(STRING_TO_ARRAY(REGEXP_REPLACE(M.message, '[^[:alnum:]]+', '', 'g'), NULL)))) <@ 
ARRAY(SELECT ASCII(unnest(STRING_TO_ARRAY('АаБбВвГгДдЕеЁёЖжЗзИиЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщЪъЫыЬьЭэЮюЯя0123456789', NULL))))
OMANSAK
  • 1,066
  • 1
  • 12
  • 30