1

I'm looking for an efficient way to take a JavaScript string and return all of the scripts which occur in that string.

Full UTF-16 including the "astral" plane / non-BMP characters which require surrogate pairs must be correctly handled. This is possibly the main problem since JavaScript is not UTF-16 aware.

It only has to deal with codepoints so no fancy awareness of complex scripts or grapheme clusters is necessary. (This will be obvious to some of you anyway.)

Example:

stringToIso15924("παν語");

would return something like:

[ "Grek", "Hani" ]

I'm using node.js and some Unicode libraries such as XRegExp and unorm already so I don't mind adding other libraries that might already handle or ease such a feature.

I'm not aware of a JavaScript library that can look up character properties such as script codes, so this is probably the second part of the problem.

The third part of the problem is just to avoid inefficiencies.

hippietrail
  • 15,848
  • 18
  • 99
  • 158
  • Is there any source (i.e. table) you can reference that already maps (ranges, hopefully) of UTF-16 characters (by their code) to the script codes? – Paul S. May 09 '13 at 01:51
  • I think I found the beginning of the story of how the Script property of a Unicode character relates to ISO 15924. http://unicode.org/reports/tr24/#Relation_To_ISO15924 – minopret May 09 '13 at 01:52
  • @PaulS. I don't know if there's some source already prepared for JavaScript but there is the raw [UnicodeData.txt](http://www.unicode.org/Public/UNIDATA/UnicodeData.txt) on the Unicode site which I've processed for such things in the past in Python and Perl. – hippietrail May 09 '13 at 01:59
  • If you want to make your own mapping function, this may be more helpful than every letter individually http://www.unicode.org/Public/UNIDATA/Scripts.txt . The task now is to calculate your character's utf-16 code, then loop until you find in which group it resides. – Paul S. May 09 '13 at 02:02
  • [**Getting pairs from utf-16 code** - you'll want the reverse](http://stackoverflow.com/questions/7126384/expressing-utf-16-unicode-characters-in-javascript). The labourious bit will be making the an _Array_ `[{start: 0x0000, end: 0x0040, script: 'Common'}, {start: 0x0041, end: 0x005A, script: 'Latin'}, ...]`, so you can find your script. If there are a very many broken groups, it may be worth an array with an index for each character, but this will take up a lot of memory (traded for cpu). For the results, just add script names as keys to an empty object then just do `Object.keys` – Paul S. May 09 '13 at 02:16
  • @PaulS. Ah yes I believe UnicodeData is the "raw" data and there are a bunch of other "derived" data files. I once implemented something related using binary search. It's probably best to build the table once and include it as raw data in the script, rather than compute it in real time. But maybe it is in some js lib out there already? – hippietrail May 09 '13 at 02:21

1 Answers1

2

I answered a similar question, well at least related. In this pastebin you will a (looooong) function that returns the script name for a character. It should be easy to modifiy it to accommodate a string.

Community
  • 1
  • 1
dda
  • 6,030
  • 2
  • 25
  • 34