15

I want to read a file that contains also German and not only characters. I found that i can do like this

  >>> import codecs
  >>> file = codecs.open('file.txt','r', encoding='UTF-8')
  >>> lines= file.readlines()

This is working when i try to run my job in Python IDLE but when i try to run it from somewhere else does not give correct result. Have a idea?

indiag
  • 233
  • 1
  • 4
  • 10
  • What version of python are you using? – Jeff Jun 18 '12 at 16:10
  • 1
    It depends what encoding the file was saved with. iso8859-1 is probably good guess if it's not UTF-8. – Wooble Jun 18 '12 at 16:10
  • python3.1. Really how we see the current version we use? – indiag Jun 18 '12 at 16:11
  • @indiag, [`sys.version`](http://docs.python.org/library/sys.html#sys.version) or [`sys.version_info`](http://docs.python.org/library/sys.html#sys.version_info). – Andrew Clark Jun 18 '12 at 16:14
  • Again it is not working with iso8859-1. I have the characters ö,ü,ä,ß – indiag Jun 18 '12 at 16:14
  • ok thanks. This is my version 3.1 (r31:73574, Jun 26 2009, 17:50:52) [MSC v.1500 64 bit (AMD64)] – indiag Jun 18 '12 at 16:16
  • 1
    @indiag, Try reading the file in binary mode using `open('file.txt', 'rb').readlines()`, and then use `print(repr(line))` for a line that you know contains the German characters, as well as what you expect it to be. This should help us determine what the encoding is. – Andrew Clark Jun 18 '12 at 16:19
  • sorry. It is not contains only German characters. Exists also a name Božović. It is like a phone book – indiag Jun 18 '12 at 16:20
  • @F.J it is not working again. I ll post a part of the text file – indiag Jun 18 '12 at 16:22
  • sorry guys i do not know now is working lines = codecs.open('fbc_math.txt','r', encoding='UTF-8').readlines() – indiag Jun 18 '12 at 16:26
  • 1
    @indiag, it just occurred to me that `readlines()` probably doesn't work in binary mode, try `print(repr(open('file.txt', 'rb').read()))`, and then post all or a portion of the output. – Andrew Clark Jun 18 '12 at 16:27
  • @F.J this that you suggest give me strange results. I do not know – indiag Jun 18 '12 at 16:31
  • @indiag If you found the solution to your problem, it would be better to post this as an answer, not as an edition in your question. Post it as an answer and accept it. – brandizzi Jun 18 '12 at 16:36
  • @brandizzi ok i ll do it. But this working only from python IDLE. I ll change the question. – indiag Jun 18 '12 at 16:44
  • You need to better define what you mean by "not working". Is it giving an error, or are the wrong characters being displayed? – Mark Ransom Jun 18 '12 at 16:50
  • @MarkRansom when i run the program in linux print this special characters with some strange form BoΕΎoviΔ, Nemanja and when i run it in windows by cmd it gives a message 'return codecs.charmap_encode(input,self.errors,encoding_map)' – indiag Jun 18 '12 at 16:55

2 Answers2

23

You need to know which character encoding the text is encoded in. If you don't know that beforehand, you can try guessing it with the chardet module. First install it:

$ pip install chardet

Then, for example reading the file in binary mode:

>>> import chardet
>>> chardet.detect(open("file.txt", "rb").read())
{'confidence': 0.9690625, 'encoding': 'utf-8'}

So then:

>>> import codecs
>>> import unicodedata
>>> lines = codecs.open('file.txt', 'r', encoding='utf-8').readlines()
Michael Goldshteyn
  • 71,784
  • 24
  • 131
  • 181
Chewie
  • 7,095
  • 5
  • 29
  • 36
0

I believe the file is being read correctly but is using the wrong encoding when output. This is based on the fact that you get the proper results in IDLE.

I would suggest trying to use print(line.encode('utf-8')) but I'm afraid I don't know if Python 3 will print a bytes object properly.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622