5

I'm working on cleaning up a database of "profiles" of entities (people, organizations, etc), and one such part of the profile is the name of the individual in their native script (e.g. Thai), encoded in UTF-8. In the previous data structure we didn't capture the character set of the name, so now we have more records with invalid values than possible to manually review.

What I need to do at this point is, via script, determine what language/script any given name is in. With a sample data set of:

Name: "แผ่นดินต้น"
Script: NULL

Name: "አብርሃም"
Script: NULL

I need to end up with

Name: "แผ่นดินต้น"
Script: Thai

Name: "አብርሃም"
Script: Amharic

I do not need to translate the names, just determine what script they're in. Is there an established technique for figuring this sort of thing out?

Andy
  • 3,132
  • 4
  • 36
  • 68
  • You can try with https://metacpan.org/pod/Encode::Guess. It might tell you what it is for a lot of them, and then you can actually convert instead of deleting. The ones it can't guess you could delete. Can you add some example data of what you have in the DB? – simbabque Jul 26 '16 at 18:15
  • Ligua::Identify is for languages, not for encodings. I'm sure that didn't work out well. – simbabque Jul 26 '16 at 18:17
  • @simbabque Deleting is 100% out of the question, we'll just have to come up with another way to deal with the stragglers. Unfortunately, I can't share any examples, but the data I'll be dealing with is really no more complex than names in (possibly) other languages than English. – Andy Jul 26 '16 at 18:17
  • @simbabque it served well enough for me to be able to tell the requestor that I wouldn't be able to address the question. :D – Andy Jul 26 '16 at 18:18
  • What is the DB columns encoding? Where those written in the native scripts non-utf8 encoding? I.e. Latin-1 for English or German, cp-1251 for Russian/Cyrillic and so on? I had to convert a CSV file with translations that was saved as Latin-1 but had languages from all over Europe to individual .po files for i18n in utf-8, and Encode::Guess was very helpful for that. – simbabque Jul 26 '16 at 18:19
  • @simbabque the NVARCHAR columns are in UTF-8. I'll add that to the question, thanks. – Andy Jul 26 '16 at 18:21
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/118346/discussion-between-andy-and-simbabque). – Andy Jul 26 '16 at 18:23
  • Related: http://stackoverflow.com/questions/9868792/find-out-the-unicode-script-of-a-character – Robᵩ Jul 26 '16 at 19:04

2 Answers2

2

You can use charnames in Perl to figure out the name of a given character.

use strict;
use warnings;
use charnames '';
use feature 'say';
use utf8;

say charnames::viacode(ord 'Բ');

__END__
ARMENIAN CAPITAL LETTER BEN

With that, you can break apart all you strings into characters, and then build a counting hash for each type of character group. Figuring out groups from this is a bit tricky but it's a start. Once you're done with a string, the group with the highest count should win. That way, you'll not have punctuation or numbers get in the way.

Probably it's smarter to find something that already has the names of ranges in unicode and makes it easy to look up. I know there is at least one module on CPAN that does that, but I cannot find it right now. Something like that can be abused to make the lookup easier.

simbabque
  • 53,749
  • 8
  • 73
  • 136
2

Using the unicodedata2 Python module described here and here, you can examine the Unicode script for each character, like so:

#!/usr/bin/env python2
#coding: utf-8

import unicodedata2
import collections

def scripts(name):
    scripts = [unicodedata2.script(char) for char in name]
    scripts = collections.Counter(scripts)
    scripts = scripts.most_common()
    scripts = ', '.join(script for script,_ in scripts)
    return scripts


assert scripts(u'Rob') == 'Latin'
assert scripts(u'Robᵩ') == 'Latin, Greek'
assert scripts(u'Aarón') == 'Latin'
assert scripts(u'แผ่นดินต้น') == 'Thai'
assert scripts(u'አብርሃም') == 'Ethiopic'
Community
  • 1
  • 1
Robᵩ
  • 163,533
  • 20
  • 239
  • 308