0

I am trying to open, print, and read a text file that contains special characters such as §. Below is the code I am running:

    import codecs
    f = codecs.open('sample_text.txt', mode='r', encoding='utf_8')
    print f.readline()

The first two lines work, but the third does not. The error code says: Traceback (most recent call last):

"C:\Users\mallikk\Documents\Python Scripts\special_char_test.py", line 6, in <module>
    print f.readline()
  File "C:\Anaconda2\lib\codecs.py", line 690, in readline
    return self.reader.readline(size)
  File "C:\Anaconda2\lib\codecs.py", line 545, in readline
    data = self.read(readsize, firstline=True)
  File "C:\Anaconda2\lib\codecs.py", line 492, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa7 in position 13: invalid start byte

Any ideas? Please let me know if I can clarify anything or add more details. Thank you so much!

Shivani
  • 105
  • 1
  • 9
  • 4
    This file is not encoded in UTF-8. Find the actual encoding and use that. – user2357112 Jun 23 '16 at 16:42
  • I don't think that 0xa7 is valid utf8. Are you sure it's in utf-8? Also why are you using codecs and not `open`? – syntonym Jun 23 '16 at 16:47
  • http://stackoverflow.com/questions/4255305/how-to-determine-encoding-table-of-a-text-file – stark Jun 23 '16 at 16:53
  • 1
    @user2357112 It was not in utf-8. I changed it in Notepad++. Thanks for the help! – Shivani Jun 23 '16 at 16:57
  • @syntonym I was under the impression that to deal with special characters like §, I would need to use codecs – Shivani Jun 23 '16 at 16:57
  • 1
    @Shivani [This question](http://stackoverflow.com/questions/5250744/difference-between-open-and-codecs-open-in-python) discusses codecs.open vs builtin open and io.open. Looks like you are right in python2 while in python3 `open` is preferred. – syntonym Jun 23 '16 at 17:07

2 Answers2

1

To expand on what the commenters said, you need to find out the encoding of your file. The easiest way I know to do that is to:

  1. Open the file in Firefox.
  2. Right-click on the page and select "View Page Info"
  3. See what the "Text Encoding" is.
  4. Then you can check the codecs documentation for the codec to use instead of utf_8 in your f = codecs.open(...) line.

Screenshot of steps 1–3:

screenshot

cxw
  • 16,685
  • 2
  • 45
  • 81
0

It looks like you are on a windows machine where encoding for the text file might be different from UTF-8, you might want to try cp1252/ISO-8859-1 use for decoding the bytestring and then encode it again using utf-8.

You can also take a look here for an advice on a best-practice how to read files - Difference between open and codecs.open in Python

Community
  • 1
  • 1
Stanley Kirdey
  • 602
  • 5
  • 20