0

I'm using the following code to print (to file) UTF-8 str :

output.write(currentWord.m_wordHeb)

tried also :

output.write(currentWord.m_wordHeb.encode('utf-8')

and also added :

import sys
import codecs

sys.stdout = codecs.getwriter('utf8')(sys.stdout)

keeps getting errors ... usually this one :

UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2: character maps to <undefined>

Thank you !

SagiLow
  • 5,721
  • 9
  • 60
  • 115
  • What is the type and value of `currentWord.m_wordHeb`? Please show the output of `print(type(currentWord.m_wordHeb))` and `print(currentWord.m_wordHeb)` – Mark Byers Dec 22 '12 at 20:06
  • i get the following error for both print lines : 'TypeError: must be str, not bytes' I build the string using several methods, which returns something like : u'א' (which is Hebrew char) and building as : 'string += method()' So i can't understand why it is bytes and not str. – SagiLow Dec 22 '12 at 20:25
  • 1
    I highly doubt that `print(type(...bla...))` gives you the error message `TypeError: must be str, not bytes` unless you have done something incredibly nasty such as changing the definition of `print`. But if you can show a complete, self-contained piece of code that is runnable and gives this error, I'd be interested to see it. Because when I run your code (after guessing the missing parts) I do *not* get that error. – Mark Byers Dec 22 '12 at 20:31
  • well, that was the error, whenever i added 'wb' to the open command of the file that worked out, but the characters are wrong (not hebrew text but garbage) – SagiLow Dec 22 '12 at 20:51
  • Note that the line "sys.stdout = codecs.getwriter('utf8')(sys.stdout)" should be replaced with "sys.stdout = codecs.getwriter('utf8')(sys.stdout.detach())" to work well with python 3.x. See http://stackoverflow.com/a/4374457/1825043. – Christian O'Reilly Jun 16 '16 at 08:58

2 Answers2

0

In python 3 you can only encode a string. If you currently have bytes instead, it's because your method is returning them as so. If you read the bytes from a file, for example, you should decode them into a string as soon as possible. Only then can you encode them to utf-8.

loopbackbee
  • 21,962
  • 10
  • 62
  • 97
  • How can i return the Hebrew characters as string and not as bytes ? – SagiLow Dec 22 '12 at 20:51
  • @SagiLow in my python3 interpreter, if I enter 'א', it actually returns a string. If you have bytes (which you will, if you're reading the file as binary), you just need to do bytes.decode('utf-8') (or whatever encoding you're using) – loopbackbee Dec 22 '12 at 22:41
  • I don't think it has something to do with the way i read the file, since i'm trying to print a simple string variable which was created by the methods i mentioned above. – SagiLow Dec 24 '12 at 07:56
  • @SagiLow doing a simple `'א'.encode('utf-8')` on the python3 shell fails? – loopbackbee Dec 24 '12 at 14:09
  • The string is built just fine, the problem is with the "Writer". If i use 'print(currentWord.m_wordHeb)' that works just fine, but if i use outputFile.write((currentWord.m_wordHeb)) i get an error ! – SagiLow Dec 24 '12 at 19:35
  • @SagiLow It would be worthwhile to just provide a small snippet that we can **run** in your question, and reproduces your problem. What is `output`/`outputFile`? A file, a stream? Are you just writing to `stdout`? If so, isn't `print` sufficient? – loopbackbee Dec 24 '12 at 20:54
  • I open the output file : inline 'code' (outputFile = open('C:\\X\\output.txt' ,'w')) than i create the str as i specified .. and tries to write back to the file : [code] outputFile.write((currentWord.m_wordHeb))[/code] – SagiLow Dec 26 '12 at 08:27
0

Problem was solved, the file i opened for writing wasn't opened as it should for utf-8. So when i changed the 'open command' to the following :

codecs.open("C:\\NLP\\output.txt", "w", "utf-8" )

everything seems to work out.

Thank you !

SagiLow
  • 5,721
  • 9
  • 60
  • 115