2

I am not sure I've got my nomenclature right, so please correct me :)

I've received a text file representing a Pāli dictionary: a list of words separated by newline \n (0x0a) characters. Supposedly, some of the special letters are encoded using UTF-8, but I doubt that.

Loading this text file into any of my editors (vim, Notepad, TextEdit, ..) shows quite scrambled text, for example

mhiti

A closer look at the actual bytes then reveal the following (using hexdump -C)

0a 0a 1e 6d 68 69 74 69 0a 0a  ...mhiti..

which seems to me the Unicode code point U+1E6D ("ṭ" or LATIN SMALL LETTER T WITH DOT BELOW). That particular letter has UTF-8 encoding e1 b9 ad.

My question: is there a tool which helps me convert this particular file into actual UTF-8 encoding? I tried iconv but without success; I looked briefly into a Python script but would think there's an easier way to get this done. It seems that this is a useful link for this problem, but isn't there a tool that can get this done? Am I missing something?

EDIT: Just to make things a little bit more entertaining, there seem to be actual UTF-8 encoded characters scattered throughout as well. For example, the word "ākiñcaññāyatana" has the following sequence of bytes

01 01 6b 69 c3 b1 63 61 c3 b1 c3 b1 01 01 79 61 74 61 6e 61
ā     k  i  ñ     c  a  ñ     ñ     ā     y  a  t  a  n  a

where the "ā" is encoded by its Unicode code point U-0101, and the "ñ" is encoded by the UTF-8 sequence \xc3b1 which has Unicode code point U-00F1.

EDIT: Here's one that I can't quite figure out what it's supposed to be:

01 1e 37 01 01 76 61 6b 61
?        ā     v  a  k  a

I can only guess, but that too doesn't make sense. The Unicode code point U+011e is a "Ğ" (UTF-8 \xc49e) but that's not a Pāli character AFAIK; then a "7" follows which doesn't make sense in a word. Then the Unicode code point U+1E37 is a "ḷ" (UTF-8 \xe1b8b7) which is a valid Pāli character. But that would leave the first byte \x01 by itself. If I had to guess I would think this is the name "Jīvaka" but that would not match the bytes. LATER: According to the author, this is "Āḷāvaka" — so assuming the heuristics of character encoding from above, again a \x00 is missing. Adding it back in

01 00 1e 37 01 01 76 61 6b 61
Ā     ḷ     ā     v  a  k  a

Are there "compressions" that remove \x00 bytes from UTF-16 encoded Unicode files?

Community
  • 1
  • 1
Jens
  • 8,423
  • 9
  • 58
  • 78
  • 1
    Yeah, that is clearly not UTF-8. If that is indeed "ṭhiti", then it does not look like any sane Unicode encoding to me. – R. Martinho Fernandes Apr 02 '13 at 12:51
  • You could try interpreting every 2-byte sequence that starts with a byte > 127 as a Unicode codepoint. But that's a sketchy encoding scheme at best. If you could show us a bit more of the hexdump (together with the expected text), we might find a pattern there. – Joachim Sauer Apr 02 '13 at 12:59
  • @JoachimSauer and it would not even work for the example given... – R. Martinho Fernandes Apr 02 '13 at 13:13
  • @R.MartinhoFernandes: D'oh! Right ... – Joachim Sauer Apr 02 '13 at 13:49
  • This looks like UTF-16 with all the zero bytes removed. – Joni Apr 02 '13 at 18:18
  • @R.MartinhoFernandes: Thanks for validating my suspicion :) – Jens Apr 02 '13 at 21:52
  • @JoachimSauer: That's what I thought at first but I think I'll have to be more restrictive. For example, \x0101 is a "ā" character but still > \x007f. I'll probably constrain to [a-zA-Z] and interpret everything outside of that as a Unicode code point. – Jens Apr 02 '13 at 21:54
  • 1
    @Joni: It does. See my answer to Joachim. An alternative approach would be to insert all the \x00 bytes where I suspect them to be missing, i.e. before every [a-zA-Z] ... – Jens Apr 02 '13 at 21:55
  • This continues to look like a "corruption" more than an actual encoding. – Joachim Sauer Apr 04 '13 at 15:03

2 Answers2

3

I'm assuming in this context that "ṭhiti" makes sense as the contents of that file.

From your description, it looks like that file encodes characters < U+0080 as a single byte and characters > U+0100 as two-byte big-endian. That's not decodable, in general; two linefeeds (U+000A, U+000A) would have the same encoding as GURMUKHI LETTER UU (U+0A0A).

There's no invocation of iconv that'll decode it for you; you'll either need to take the heuristics you know, either based on character ranges or ordering in the file, to write a custom decoder (or ask for another copy in a standard encoding).

Joe
  • 29,416
  • 12
  • 68
  • 88
  • Thanks Joe for the confirmation as well :) Yes, this is supposed to be the Pāli word "ṭhiti". See my answers above as to what I think I'll do about this... – Jens Apr 02 '13 at 21:57
  • Yes, inserting `\x00` before every `[\na-zA-Z]` and outputting the next two bytes as-is otherwise would be a great first step. That can be piped through `iconv -f utf-16be -t utf-8` and inspected for exceptions. – Joe Apr 03 '13 at 10:41
1

I think in the end this was my own fault, somehow. Browsing to this file showed a very mangled and broken version of the original UTF-16 encoded file; the "Save as" menu from the browser then saved that broken file which created the initial question for this thread.

It seems that a web browser tries to display that UTF-16 encoded file, removes non-printable characters like \x00 and converts some others to UTF-8, thus completely mangling the original file.

Using wget to fetch the file fixed the problem, and I could convert it nicely into UTF-8 and use it further.

Jens
  • 8,423
  • 9
  • 58
  • 78