0

I am trying to read some unicode files that I have locally. How do I read unicode files while using a list? I've read the python docs, and a ton of stackoverflow Q&A's, which have answered a lot of other questions I had, but I can't find the answer to this one.

Any help is appreciated.

Edit: Sorry, my files are in utf-8.

  • 1
    What is your current code? – BrenBarn Dec 31 '13 at 07:16
  • 3
    There is no such thing as "a Unicode file". There are several *encodings* that can be used to encode Unicode strings into bytes, the most common of which is `utf-8`. Is that the encoding of your files? If not, which one is? Do your files have a [BOM (Byte Order Mark)](http://en.wikipedia.org/wiki/Byte_order_mark)? – Tim Pietzcker Dec 31 '13 at 07:18
  • Yes, my files are in UTF-8. – user3148596 Dec 31 '13 at 07:28

1 Answers1

2

You can open UTF-8-encoded files by using

import codecs
with codecs.open("myutf8file.txt", encoding="utf-8-sig") as infile:
    for line in infile:
        # do something with line

Be aware that codecs.open() does not translate \r\n to \n, so if you're working with Windows files, you need to take that into account.

The utf-8-sig codec will read UTF-8 files with or without a BOM (Byte Order Mark) (and strip it if it's there). On writing, you should use utf-8 as a codec because the Unicode standard recommends against writing a BOM in UTF-8 files.

Community
  • 1
  • 1
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • It's fairly easy to ignore any UTF-8 BOM that might be at beginning of the input file with an `if infile.read(len(codecs.BOM_UTF8)) != codecs.BOM_UTF8: infile.seek(0)` following the `with` statement. – martineau Dec 31 '13 at 08:46
  • @martineau: It's probably easier to use the `utf-8-sig` codec for this (but you shouldn't use it for writing, therefore I hadn't included it in my answer). – Tim Pietzcker Dec 31 '13 at 10:19
  • Then it's even easier than I thought. You seem overly concerned about writing files considering it's not even mentioned in the OP's question. – martineau Dec 31 '13 at 11:16
  • @martineau: I guess you're right. Well, at one point or another *something* will be output by a program, and my guess is that this will have something to do with the files the program has read. I have edited my answer a bit. – Tim Pietzcker Dec 31 '13 at 11:39
  • Thank you for your help, Tim Pietzcker and @martineau. – user3148596 Dec 31 '13 at 19:16