3

I have this legacy code snippet, which (apparently) decodes double-encoded UTF-8 text back to normal UTF-8:

# Run with python3!
import codecs
import sys
s=codecs.open('doubleutf8.dat', 'r', 'utf-8').read()
sys.stdout.write(
                s
                .encode('raw_unicode_escape')
                .decode('utf-8')
        )

I need to translate it to Lua, and imitate all possible decoding side-effects (if any).

Limitations: I may use any of available Lua modules for UTF-8 handling, but preferably the stable one, with LuaRocks support. I will not use Lupa or other Lua-Python bridging solution, neither will I call os.execute() to invoke Python.

Alexander Gladysh
  • 39,865
  • 32
  • 103
  • 160

1 Answers1

3

You can use lua-iconv, the Lua binding to the iconv library. With it you can convert between character encodings as much as you like.

It is also available in LuaRocks.

Edit: using this answer I have been able to correctly decode the data using the following Lua code:

require 'iconv'
-- convert from utf8 to latin1
local decoder = iconv.new('latin1', 'utf8')
local data = io.open('doubleutf8.dat'):read('*a')
-- decodedData is encoded in utf8
local decodedData = decoder:iconv(data)
-- if your terminal understands utf8, prints "нижний новгород"
-- if not, you can further convert it from utf8 to any encoding, like KOI8-R
print(decodedData)
Community
  • 1
  • 1
Michal Kottman
  • 16,375
  • 3
  • 47
  • 62
  • Um, thanks, but the point of the question is that I'm a bit confused by the Python's UTF conversion stuff (what is the `raw_unicode_escape` for example?), and would like to see an actual piece of Lua code. Sorry for being lazy here. – Alexander Gladysh Feb 17 '11 at 19:47
  • A sample file would help, I don't know what data to expect, I will try to make an example with lua-iconv. Also, `raw_unicode_escape` means: 'Produce a string that is suitable as raw Unicode literal in Python source code'. – Michal Kottman Feb 17 '11 at 20:08
  • The bogus data (as encoded Lua string literal, join the strings): "\034\195\144\194\189\195\144\194\184\195\144\194\182\195".. "\144\194\189\195\144\194\184\195\144\194\185\032\195\144".. "\194\189\195\144\194\190\195\144\194\178\195\144\194\179".. "\195\144\194\190\195\145\194\128\195\144\194\190\195\144".. "\194\180\034" – Alexander Gladysh Feb 17 '11 at 20:34
  • I hope that I did not mess the encoding :) – Alexander Gladysh Feb 17 '11 at 20:35