I have a table with messages, something like this
id message
1 hi
2 ok
3 துய
4 нет
In result I want to get only the fourth row, since the first 3 ones are written in different language
I have a table with messages, something like this
id message
1 hi
2 ok
3 துய
4 нет
In result I want to get only the fourth row, since the first 3 ones are written in different language
You could check for certain ranges of characters.
For example, this query checks if the string contains a Cyrillic letter:
SELECT id FROM mytable
WHERE message ~ '[АаБбВвГгДдЕеЁёЖжЗзИиЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщЪъЫыЬьЭэЮюЯя]';
You may try some crude language detection via letter unigrams. This is extensible to cases where character ranges cannot be used to neatly disambiguate languages (e.g.: Latin languages).
Here is a working gist of this.
Basically:
create table charfreq (
c text,
f float,
lang text
);
insert into charfreq values
('а', 8.04, 'ru'),
('б', 1.55, 'ru'),
('в', 4.75, 'ru'),
...
('a', 8.167, 'en'),
('b', 1.492, 'en'),
('c', 2.782, 'en'),
...
('ï', 0.005, 'fr'),
('ô', 0.023, 'fr'),
('ù', 0.058, 'fr'),
('û', 0.06 , 'fr');
Then, assuming, e.g.
...
insert into test values
(1, 'hi'),
(2, 'ok'),
(3, 'துய'),
(4, 'нет'),
(5, 'été'); -- a French message, just for fun
You can use:
select id, message, lang, score
from (
select *, row_number() OVER (partition by id, message order by score desc) as rownum
from (
select id, message, lang, coalesce(sum(f), 0)/length(message) as score
from (
select *, unnest(STRING_TO_ARRAY(message, NULL)) as c from test
) as a
left join charfreq b
using (c)
group by 1,2,3
) as a
) as a
where rownum = 1;
Outcome on the test table above:
id | message | lang | score |
---|---|---|---|
1 | hi | en | 6.53 |
2 | ok | en | 4.1395 |
3 | துய | null | 0 |
4 | нет | ru | 6.92 |
5 | été | fr | 3.41733333333333 |
Notes
The above is quite crude. A better approach is to use a proper unigram language model. In this other gist, we use such a model (using sum(log(P))
instead of prod(P)
), and a default probability (chars not matching) arbitrarily set to 0.01%. In that gist, notice how 'les aventures'
is correctly identified as French, whereas 'the adventures'
is (also correctly) decided to be English, even though the same [a-z]
range is used.
References
Table
CREATE TABLE MESSAGE (
Id INTEGER PRIMARY KEY,
MESSAGE VARCHAR (50)
);
Values
INSERT INTO MESSAGE VALUES (1, 'hi');//False
INSERT INTO MESSAGE VALUES (2, 'ok');//False
INSERT INTO MESSAGE VALUES (3, 'துய');//False
INSERT INTO MESSAGE VALUES (4, 'нет');//True
INSERT INTO MESSAGE VALUES (5, 'нет-_*/?/()=.,123 ');//True
INSERT INTO MESSAGE VALUES (6, 'нет 123');//True
INSERT INTO MESSAGE VALUES (6, 'нет 123AAAA');//False
Query
SELECT * FROM message m
WHERE
ARRAY(SELECT ASCII(unnest(STRING_TO_ARRAY(REGEXP_REPLACE(M.message, '[^[:alnum:]]+', '', 'g'), NULL)))) <@
ARRAY(SELECT ASCII(unnest(STRING_TO_ARRAY('АаБбВвГгДдЕеЁёЖжЗзИиЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщЪъЫыЬьЭэЮюЯя0123456789', NULL))))