Corrupted Hebrew: saved as ansi - covert back to UTF-8

Question

I suspect some data has been saved (on windows machines) as ANSI. Therefore, the original Hebrew characters got lost and what we see is stuff like ùéôåãé äòéø.

Is the information lost or is there a possibility to map back the characters, knowing that the original text was Hebrew?

score 1 · Answer 1 · edited May 23 '17 at 11:51

The information is probably not lost, or at most partially lost. If you want to use Python:

import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open("input.txt", "r", "windows-1255") as sourceFile:
    with codecs.open("output.txt", "w", "utf-8") as targetFile:
        while True:
            contents = sourceFile.read(BLOCKSIZE)
            if not contents:
               break
            targetFile.write(contents)

Stolen and adapted from How to convert a file to utf-8 in Python?

You can also use an external tool, like iconv:

iconv -f windows-1255 -t utf-8 input.txt > output.txt

Iconv is available in most Linux distibutions, in Cygwin, and on other platforms.

If the file got double-mangled, you may need to do something like this:

iconv -f utf-8 -t windows-1252 input.txt > tmp.txt
iconv -f windows-1255 -t utf-8 tmp.txt > output.txt

but the chances that this kind of stuff happened are minuscule.

It might be a dedicated Hebrew code page rather than UTF-8. – Mark Ransom Apr 18 '15 at 13:25 — Mark Ransom, Apr 18 '15 at 13:25

Corrupted Hebrew: saved as ansi - covert back to UTF-8

1 Answers1