5

I'm wanting to get back into JavaScript Unicode programming.

In fact I want to do everything in JavaScript since I can do it wherever I have a browser.

The most important resource for low-level Unicode is the machine-readable file UnicodeData.txt, which is officially available via FTP:

ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt

But if I'm doing everything in JavaScript I'll need fetch that file to process, since I'm not aware of any JavaScript libraries which provide the data via some wrapper etc.

But of course JavaScript can only use XMLHttpRequest, which doesn't support FTP.

I thought I had located the file via HTTP at unicode.org too, but it didn't support CORS (Cross-origin resource sharing) and I think it was only an ancient Unicode 1.0 version anyway.

So does anyone know of any HTTP URL where I can fetch an up-to-date UnicodeData.txt via JavaScript?

Maybe Google or ICU or Yahoo hosts some machine-readable files? Or maybe somebody even made a JSON version of it so I can use JSONP to fetch it instead of needing CORS?


Why do I want to do this? I want to implement various functionalities as are supported by Python's unicodedata module and Perl's Unicode::UCD module. I've done it before but don't have access to my old code. Also my old code used Perl or Python to do some JavaScript code and table generation. Now as a learning exercise I want to do the code and table generation all in javaScript.

hippietrail
  • 15,848
  • 18
  • 99
  • 158
  • Why do you need this? Javascript has unicode bmp support built-in – Esailija Jun 07 '12 at 06:28
  • @Derek: Sorry I hit enter while editing the tags and it submitted the half-finished question. Fixed now. – hippietrail Jun 07 '12 at 06:30
  • @Esailija: BMP support is only one tiny facet of Unicode. Right now I want to play with grapheme clusters, combining characters, character classes etc. For instance, Python has a `unicodedata` module which wraps this info - [check out all the stuff it can do](http://docs.python.org/library/unicodedata.html). – hippietrail Jun 07 '12 at 06:31
  • You know what, you can actually do this in JavaScript: `"\u0041"` gives you `"A"`, `"\u0042" gives you `"B"`. And then you can print out the whole table. – Derek 朕會功夫 Jun 07 '12 at 06:34
  • @hippietrail well it's not tiny, in fact, 75% of that file is BMP characters and the rest of the characters are not that used, and even then you can create them in javascript by using custom String.fromCharCode that builds surrogate pairs. – Esailija Jun 07 '12 at 06:34
  • 1
    @Esailija: You're only thinking about codepoints, there is a lot more to Unicode than the list of codepoints. Also I want to make code that works with [characters which are "not that used"](http://stackoverflow.com/questions/5567249/what-are-the-most-common-non-bmp-unicode-characters-in-actual-use). – hippietrail Jun 07 '12 at 06:37
  • 4
    There's actually also a http server running on `ftp.unicode.org` port 80, so http://ftp.unicode.org:80/Public/UNIDATA/UnicodeData.txt works. Or course, they might decide to change that at some point... – Ulrich Schwarz Jun 07 '12 at 06:42
  • So 30 or so characters are used in special math context outside BMP? If you are building some kind of unicode helper application, then I understand the need for this. But not for generic application. @UlrichSchwarz true but it doesn't send CORS headers. – Esailija Jun 07 '12 at 06:46
  • 2
    @Esailija: I am building some kind of Unicode helper application. – hippietrail Jun 07 '12 at 06:49
  • 1
    ok then :) Can't you just use [YQL](http://developer.yahoo.com/yql/) then? – Esailija Jun 07 '12 at 06:54
  • Actually I'm thinking of YQL as a fallback - it's a great service! – hippietrail Jun 07 '12 at 06:57
  • 3
    @UlrichSchwarz: the correct HTTP url is [http://www.unicode.org/Public/UNIDATA/UnicodeData.txt](http://www.unicode.org/Public/UNIDATA/UnicodeData.txt) – Remy Lebeau Jun 08 '12 at 04:09
  • 10 years later and Unicode v14, still no JSON, no CORS. – NVRM Apr 12 '22 at 13:54

1 Answers1

1

Although it's not "official", and although it provides data at a higher level than the raw UnicodeData.txt file, Mathias Bynens' unicode-data project might be useful to you. It offers an HTTP API, detailed in the README on GitHub, as well as here.

slevithan
  • 1,394
  • 13
  • 20