11

My code looks like the following:

for file in glob.iglob(os.path.join(dir, '*.txt')):
    print(file)
    with codecs.open(file,encoding='latin-1') as f:
        infile = f.read()

with codecs.open('test.txt',mode='w',encoding='utf-8') as f:
    f.write(infile)

The files I work with are encoded in Latin-1 (I could not open them in UTF-8 obviously). But I want to write the resulting files in utf-8.

But this:

<Trans audio_filename="VALE_M11_070.MP3" xml:lang="español">
<Datos clave_texto=" VALE_M11_070" tipo_texto="entrevista_semidirigida">
<Corpus corpus="PRESEEA" subcorpus="ESESUMA" ciudad="Valencia" pais="España"/>

Instead becomes this (in gedit):

<Trans audio_filename="VALE_M11_070.MP3" xml:lang="espa뇃漀氀∀㸀ഀ਀㰀䐀愀琀`漀猀 挀氀愀瘀攀开琀攀砀琀漀㴀∀ 嘀䄀䰀䔀开䴀㄀㄀开 㜀

If I print it on the Terminal, it shows up normal.

Even more confusing is what I get when I open the resulting file with LibreOffice Writer:

<#T#r#a#n#s# (and so on)

So how do I properly convert a latin-1 string to a utf-8 string? In python2, it's easy, but in python3, it seems confusing to me.

I tried already these in different combinations:

#infile = bytes(infile,'utf-8').decode('utf-8')
#infile = infile.encode('utf-8').decode('utf-8')
#infile = bytes(infile,'utf-8').decode('utf-8')

But somehow I always end up with the same weird output.

Thanks in advance!

Edit: This question is different to the questions linked in the comment, as it concerns Python 3, not Python 2.7.

I.P.
  • 161
  • 1
  • 1
  • 10
  • 1
    First decode the string, then re-encode it to `utf-8`? – Christian Dean Nov 09 '16 at 17:30
  • http://stackoverflow.com/questions/14443760/python-converting-latin1-to-utf8 – Dr Xorile Nov 09 '16 at 17:39
  • What encoding are you using in Texteditor? (And are you actually using a program called Texteditor? Google doesn't turn up anything by that name.) – user2357112 Nov 09 '16 at 17:45
  • Also, why do you think the files were originally in latin-1? It looks to me like they were in some other encoding, maybe UTF-16. – user2357112 Nov 09 '16 at 17:46
  • 1
    @user3030010 and Dr Xorile both those questions ask about Python 2.7 and do not provide an answer for Python 3.5. The solution for Python 2.7 is not viable for 3.5, sadly. – I.P. Nov 10 '16 at 09:52
  • 1
    @user2357112 You might be right. Chardetect identifies it as Windows-1252 and if I save the file, it shows as UTF-16. The errors is the same though. (With utf-16 it's a different error). My solution is now to manually save the files as utf-8, which works. But it's a dirty solution and I'd prefer to do it over Python. – I.P. Nov 11 '16 at 10:20

2 Answers2

3

I have found a half-part way in this. This is not what you want / need, but might help others in the right direction...

# First read the file
txt = open("file_name", "r", encoding="latin-1") # r = read, w = write & a = append
items = txt.readlines()
txt.close()

# and write the changes to file
output = open("file_name", "w", encoding="utf-8")
for string_fin in items:
    if "é" in string_fin:
        string_fin = string_fin.replace("é", "é")

    if "ë" in string_fin:
        string_fin = string_fin.replace("ë", "ë")

    # this works if not to much needs changing...

    output.write(string_fin)

output.close();

*note for detection

Community
  • 1
  • 1
3

For python 3.6:

your_str = your_str.encode('utf-8').decode('latin-1')
Frenzi
  • 31
  • 4