1

I'm opening text file in encoding UTF-16 mode:

with open(file.txt, 'r', encoding="UTF-16") as infile:

Then I want to write to an excel file:

from csv import writer
excelFile = open("excelFile_1.csv", 'w', newline='') 
write = writer(excelFile, delimiter=',')
write.writerows([[input]])

where input is a term from the text file file.txt

I get the following error

UnicodeEncodeError: 'charmap' codec can't encode character '\xe9' in position 113: character maps to <undefined>

Using Python 3.2

Presen
  • 1,809
  • 4
  • 31
  • 46

1 Answers1

3

You need to pick an output encoding for the CSV file as well:

excelFile = open("excelFile_1.csv", 'w', newline='', encoding='UTF16') 

The default codec for your system cannot handle the codepoints you are reading from the input filename.

Opening this file in Excel may not work; do follow the procedure in this answer, picking the UTF16 codec, to ensure that Excel reads the file correctly.

You could also try using UTF-8, adding in a UTF-8 BOM to the start of the file:

excelFile = open("excelFile_1.csv", 'w', newline='', encoding='UTF8')
excelFile.write('\ufeff')  # Zero-width non-breaking space, the Byte Order Mark

It is mostly Microsoft software that uses a BOM in UTF-8 files, since UTF-8 only has one byte order to pick from, unlike UTF-16 and UTF-32, but it apparently makes Excel happy(er).

Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • I tried the second option, work great with the regular open of excel, and I didn't need to add the "\ufeff". – Presen Aug 14 '13 at 22:25
  • @user1869297 it will work without the BOM until you have some actual Unicode non-ASCII characters in the file. And I know you know this Martijn, but the purpose of the BOM in this case is not to signify byte order, it's to mark the file as UTF-8 encoded instead of one of the ancient code page encodings that Microsoft still prefers. – Mark Ransom Aug 14 '13 at 22:29
  • @MarkRansom: Yes, I know, Microsoft has to support too many legacy codecs. Note that the OP *does* have codepoints in the Latin-1 range in the output, that's why they had errors in the first place. – Martijn Pieters Aug 14 '13 at 22:32