91

I'm having problems reading from a file, processing its string and saving to an UTF-8 File.

Here is the code:

try:
    filehandle = open(filename,"r")
except:
    print("Could not open file " + filename)
    quit() 

text = filehandle.read()
filehandle.close()

I then do some processing on the variable text.

And then

try:
    writer = open(output,"w")
except:
    print("Could not open file " + output)
    quit() 

#data = text.decode("iso 8859-15")    
#writer.write(data.encode("UTF-8"))
writer.write(text)
writer.close()

This output the file perfectly but it does so in iso 8859-15 according to my editor. Since the same editor recognizes the input file (in the variable filename) as UTF-8 I don't know why this happened. As far as my reasearch has shown the commented lines should solve the problem. However when I use those lines the resulting file has gibberish in special character mainly, words with tilde as the text is in spanish. I would really appreciate any help as I am stumped....

Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
aarelovich
  • 5,140
  • 11
  • 55
  • 106
  • 2
    Which editor is this? Which python version? From here this code seems to be completely valid and should work as expected … – filmor Oct 25 '13 at 13:43
  • Kate is the editor. The output of python --version is Python 2.7.5+ – aarelovich Oct 25 '13 at 13:49
  • I've tested your code with 2.6.8, 2.7.5+ and 3.3.2+ everything works fine. Could you provide some example input? – zero323 Oct 25 '13 at 13:53
  • Since the text was processed in raw bytes the unseen processing code probably messed up the UTF8 encoding. – Mark Tolonen Oct 25 '13 at 13:59
  • I'd love to provide an example file however I can't find a way to uploaded it here... – aarelovich Oct 25 '13 at 14:08
  • @MarkTolonen I have commented all my unseen code however the error remains. It was a good idea though... – aarelovich Oct 25 '13 at 14:12
  • 3
    Ok. I've solved it. It was mostly my fault so sorry everyone. Here is what happened. The code provided by @MarkTolonen worked if I change iso-8859-15 instead of utf-8 when opening the file. However as my editor updated the file from memory having already loaded the old encoding it showed me the gibberish. When I opened the file again it showed it to me fine. Thank you all and sorry for the bother!!! – aarelovich Oct 25 '13 at 14:24

6 Answers6

223

Process text to and from Unicode at the I/O boundaries of your program using open with the encoding parameter. Make sure to use the (hopefully documented) encoding of the file being read. The default encoding varies by OS (specifically, locale.getpreferredencoding(False) is the encoding used), so I recommend always explicitly using the encoding parameter for portability and clarity (Python 3 syntax below):

with open(filename, 'r', encoding='utf8') as f:
    text = f.read()

# process Unicode text

with open(filename, 'w', encoding='utf8') as f:
    f.write(text)

If still using Python 2 or for Python 2/3 compatibility, the io module implements open with the same semantics as Python 3's open and exists in both versions:

import io
with io.open(filename, 'r', encoding='utf8') as f:
    text = f.read()

# process Unicode text

with io.open(filename, 'w', encoding='utf8') as f:
    f.write(text)
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • 7
    I did exactly what you told me. Same error as with the other suggestion – aarelovich Oct 25 '13 at 14:18
  • 1
    I've got it to work. Problem was the original file was iso-8859-15 – aarelovich Oct 25 '13 at 14:24
  • @aarelovich you may need to pass `errors=ignore` or `errors=replace` to `open()` ... if you do not know the file's encoding. :) –  Sep 22 '16 at 19:05
  • Doesn't work with the string "présenté alloué ééé ààà tué" – Lior Magen Feb 08 '17 at 14:11
  • @LiorMagen I assume you are the recent down voter. You do have to write the string in a file and specify the encoding used, which may not be UTF8. – Mark Tolonen Feb 08 '17 at 15:27
  • I can't thank you enough! Saved me when converted my colleague's old DOS file in ibm852 to utf8. – Hrvoje T Jun 09 '17 at 06:13
  • @MarkTolonen I guess it would be unnecessarily redundant to write `f.write(text.encode('utf-8'))` given the `encoding='utf8'` parameter in `io.open()`, right? – arturomp May 08 '18 at 21:31
  • 1
    @arturomp It also wouldn't work. `io.open` expects Unicode strings to be written, not byte strings. It does the encoding to the declared encoding. – Mark Tolonen May 08 '18 at 23:36
  • 1
    @arturomp Correction, it won't work on Python 3. Python 2 will implicitly convert the byte string back to Unicode using the default `ascii` codec, so it will work as long as the string is only ASCII. That's why Python 3 changed it...it prevents "it will work sometimes" which is an annoying bug to track down. – Mark Tolonen May 09 '18 at 00:09
13

You can also get through it by the code below:

file=open(completefilepath,'r',encoding='utf8',errors="ignore")
file.read()
Noel Widmer
  • 4,444
  • 9
  • 45
  • 69
Siva Kumar
  • 459
  • 4
  • 6
5

You can't do that using open. use codecs.

when you are opening a file in python using the open built-in function you will always read/write the file in ascii. To write it in utf-8 try this:

import codecs
file = codecs.open('data.txt','w','utf-8')
Fernando Freitas Alves
  • 3,709
  • 3
  • 26
  • 45
  • 2
    Tried this and I got an error: UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 57: invalid continuation byte – aarelovich Oct 25 '13 at 14:07
  • Are you saving with the utf-8 encode? look, if you`re reading from another file that is ascii, you have to decode it first. – Fernando Freitas Alves Oct 25 '13 at 14:10
  • The code is as you see it. What I did is replaced the line writer = open(output,'w') with writer = codecs.open(output,'w','utf-8') and that got me that error – aarelovich Oct 25 '13 at 14:13
1

The encoding parameter is what does the trick.

my_list = ['1', '2', '3', '4']
with open('test.txt', 'w', encoding='utf8') as file:
    for i in my_list:
        file.write(i + '\n')
0

You can try using utf-16, it might work.

data = pd.read_table(filename, encoding='utf-16', delimiter="\t")
sauravjoshi23
  • 837
  • 11
  • 9
0

A combination of @Fernando and @Silva solutions is the best for me:

with codecs.open(filename, 'r', encoding='utf-8', errors='ignore') as f:
    text = f.read()
Tedo Vrbanec
  • 519
  • 6
  • 12