Read a text file with non-ASCII characters in an unknown encoding

Question

I want to read a file that contains also German and not only characters. I found that i can do like this

  >>> import codecs
  >>> file = codecs.open('file.txt','r', encoding='UTF-8')
  >>> lines= file.readlines()

This is working when i try to run my job in Python IDLE but when i try to run it from somewhere else does not give correct result. Have a idea?

It depends what encoding the file was saved with. iso8859-1 is probably good guess if it's not UTF-8. — Wooble, Jun 18 '12 at 16:10
@indiag, [`sys.version`](http://docs.python.org/library/sys.html#sys.version) or [`sys.version_info`](http://docs.python.org/library/sys.html#sys.version_info). — Andrew Clark, Jun 18 '12 at 16:14
Again it is not working with iso8859-1. I have the characters ö,ü,ä,ß — indiag, Jun 18 '12 at 16:14
ok thanks. This is my version 3.1 (r31:73574, Jun 26 2009, 17:50:52) [MSC v.1500 64 bit (AMD64)] — indiag, Jun 18 '12 at 16:16
@indiag, Try reading the file in binary mode using `open('file.txt', 'rb').readlines()`, and then use `print(repr(line))` for a line that you know contains the German characters, as well as what you expect it to be. This should help us determine what the encoding is. — Andrew Clark, Jun 18 '12 at 16:19
sorry. It is not contains only German characters. Exists also a name Božović. It is like a phone book — indiag, Jun 18 '12 at 16:20
@F.J it is not working again. I ll post a part of the text file — indiag, Jun 18 '12 at 16:22
sorry guys i do not know now is working lines = codecs.open('fbc_math.txt','r', encoding='UTF-8').readlines() — indiag, Jun 18 '12 at 16:26
@indiag, it just occurred to me that `readlines()` probably doesn't work in binary mode, try `print(repr(open('file.txt', 'rb').read()))`, and then post all or a portion of the output. — Andrew Clark, Jun 18 '12 at 16:27
@F.J this that you suggest give me strange results. I do not know — indiag, Jun 18 '12 at 16:31
@indiag If you found the solution to your problem, it would be better to post this as an answer, not as an edition in your question. Post it as an answer and accept it. — brandizzi, Jun 18 '12 at 16:36
@brandizzi ok i ll do it. But this working only from python IDLE. I ll change the question. — indiag, Jun 18 '12 at 16:44
You need to better define what you mean by "not working". Is it giving an error, or are the wrong characters being displayed? — Mark Ransom, Jun 18 '12 at 16:50
@MarkRansom when i run the program in linux print this special characters with some strange form BoΕΎoviΔ, Nemanja and when i run it in windows by cmd it gives a message 'return codecs.charmap_encode(input,self.errors,encoding_map)' — indiag, Jun 18 '12 at 16:55

score 23 · Answer 1 · edited May 26 '18 at 20:17

23

You need to know which character encoding the text is encoded in. If you don't know that beforehand, you can try guessing it with the chardet module. First install it:

$ pip install chardet

Then, for example reading the file in binary mode:

>>> import chardet
>>> chardet.detect(open("file.txt", "rb").read())
{'confidence': 0.9690625, 'encoding': 'utf-8'}

So then:

>>> import codecs
>>> import unicodedata
>>> lines = codecs.open('file.txt', 'r', encoding='utf-8').readlines()

edited May 26 '18 at 20:17

Michael Goldshteyn

71,784
24
131
181

answered Jun 18 '12 at 16:33

Chewie

7,095
5
29
36

You have to import codecs at the top of your file: `import codecs` – duhaime Oct 21 '16 at 17:32

score 0 · Answer 2 · answered Jun 18 '12 at 17:28

I believe the file is being read correctly but is using the wrong encoding when output. This is based on the fact that you get the proper results in IDLE.

I would suggest trying to use print(line.encode('utf-8')) but I'm afraid I don't know if Python 3 will print a bytes object properly.

Read a text file with non-ASCII characters in an unknown encoding

2 Answers2

Linked