48

I have to read a text file into Python. The file encoding is:

file -bi test.csv 
text/plain; charset=us-ascii

This is a third-party file, and I get a new one every day, so I would rather not change it. The file has non ascii characters, such as Ö, for example. I need to read the lines using python, and I can afford to ignore a line which has a non-ascii character.

My problem is that when I read the file in Python, I get the UnicodeDecodeError when reaching the line where a non-ascii character exists, and I cannot read the rest of the file.

Is there a way to avoid this. If I try this:

fileHandle = codecs.open("test.csv", encoding='utf-8');
try:
    for line in companiesFile:
        print(line, end="");
except UnicodeDecodeError:
    pass;

then when the error is reached the for loop ends and I cannot read the remaining of the file. I want to skip the line that causes the mistake and go on. I would rather not do any changes to the input file, if possible.

Is there any way to do this? Thank you very much.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Chicoscience
  • 975
  • 1
  • 8
  • 18
  • Why are you using `codecs.open()` in Python 3? `open()` handles UTF-8 **just fine**. – Martijn Pieters Jul 07 '14 at 17:51
  • 2
    I also tried using open, I get the same error – Chicoscience Jul 07 '14 at 17:52
  • Do you know what encoding the file is really using? It's clearly not `us-ascii` as shown by the `file` output, since it contains non-ascii characters. – dano Jul 07 '14 at 18:08
  • @Chicoscience: I wasn't addressing your problem; I was puzzled as to why you were using `codecs.open()` here, as it is inferior to `open()`. – Martijn Pieters Jul 07 '14 at 18:11
  • Not a problem, Martijn, thanks! Dano, that is strange to me as well, the encoding says ascii but it is clearly not ascii – Chicoscience Jul 08 '14 at 11:42
  • See also: [set the implicit default encoding\decoding error handling in python](https://stackoverflow.com/questions/3363339/set-the-implicit-default-encoding-decoding-error-handling-in-python) – Evandro Coan Apr 27 '19 at 02:25

1 Answers1

99

Your file doesn't appear to use the UTF-8 encoding. It is important to use the correct codec when opening a file.

You can tell open() how to treat decoding errors, with the errors keyword:

errors is an optional string that specifies how encoding and decoding errors are to be handled–this cannot be used in binary mode. A variety of standard error handlers are available, though any error handling name that has been registered with codecs.register_error() is also valid. The standard names are:

  • 'strict' to raise a ValueError exception if there is an encoding error. The default value of None has the same effect.
  • 'ignore' ignores errors. Note that ignoring encoding errors can lead to data loss.
  • 'replace' causes a replacement marker (such as '?') to be inserted where there is malformed data.
  • 'surrogateescape' will represent any incorrect bytes as code points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These private code points will then be turned back into the same bytes when the surrogateescape error handler is used when writing data. This is useful for processing files in an unknown encoding.
  • 'xmlcharrefreplace' is only supported when writing to a file. Characters not supported by the encoding are replaced with the appropriate XML character reference &#nnn;.
  • 'backslashreplace' (also only supported when writing) replaces unsupported characters with Python’s backslashed escape sequences.

Opening the file with anything other than 'strict' ('ignore', 'replace', etc.) will then let you read the file without exceptions being raised.

Note that decoding takes place per buffered block of data, not per textual line. If you must detect errors on a line-by-line basis, use the surrogateescape handler and test each line read for codepoints in the surrogate range:

import re

_surrogates = re.compile(r"[\uDC80-\uDCFF]")

def detect_decoding_errors_line(l, _s=_surrogates.finditer):
    """Return decoding errors in a line of text

    Works with text lines decoded with the surrogateescape
    error handler.

    Returns a list of (pos, byte) tuples

    """
    # DC80 - DCFF encode bad bytes 80-FF
    return [(m.start(), bytes([ord(m.group()) - 0xDC00]))
            for m in _s(l)]

E.g.

with open("test.csv", encoding="utf8", errors="surrogateescape") as f:
    for i, line in enumerate(f, 1):
        errors = detect_decoding_errors_line(line)
        if errors:
            print(f"Found errors on line {i}:")
            for (col, b) in errors:
                print(f" {col + 1:2d}: {b[0]:02x}")

Take into account that not all decoding errors can be recovered from gracefully. While UTF-8 is designed to be robust in the face of small errors, other multi-byte encodings such as UTF-16 and UTF-32 can't cope with dropped or extra bytes, which will then affect how accurately line separators can be located. The above approach can then result in the remainder of the file being treated as one long line. If the file is big enough, that can then in turn lead to a MemoryError exception if the 'line' is large enough.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • I tried to find an alternate solution by catching the decoding exceptions themselves. Unfortunately it appears (in Python 2 at least) that the decoding occurs *before* line endings are detected, so you don't get consistent results - you might lose more than one line, or you might get hung on the same buffer forever. – Mark Ransom Jul 07 '14 at 20:40
  • @MartijnPieters The issue with `ignore` is that it will ignore invalid characters and not the whole line...so, I'd like to use `strict` and catch the Exception, to do finer-grained error handling. But like OP, I can't figure out how to do this with the for loop... – flow2k May 28 '19 at 23:09
  • @flow2k You can’t because decoding is done per block of file data, not per line. There is a work-around: use an error handler that replaces erroneous characters then look for the replacements in each line read. If you use `surrogateescape` as the error handler you can even recover the problematic bytes. I’ve added example code to the answer. – Martijn Pieters May 28 '19 at 23:32
  • @MarkRansom same idea for you, albeit 5 years late. – Martijn Pieters May 28 '19 at 23:33
  • @MartijnPieters Aha! This does it. But I don't understand one thing: if Python doesn't decode by line, then how do `ignore` and `surrogateescape` work? Don't they furnish one line at a time? – flow2k May 29 '19 at 01:15
  • @flow2k No, they operate on the same block when decoding. An encoding has no special knowledge of line delimiters. Multi-byte codecs (UTF-16 & UTF-32 specifically) encode newline characters using more than one byte, which means you can’t split text into lines without decoding first. I am not sure where the confusion lies here? – Martijn Pieters May 29 '19 at 01:49
  • @MartijnPieters I understand what you're saying about the decoding - a block must be decoded to find the newlines first. What I don't see, is why Python doesn't provide an API which lets the dev catch the `DecodeError` by line? It already does this split by newline delimiter for `ignore` and `surrogateescape`, so why not do it for `strict` error handling, too? – flow2k May 29 '19 at 06:26
  • To be more concrete, take this snippet: https://gist.github.com/flow2k/8bd4fece21fa1a0b75737a3d9fc2e86c. I'm using `readline()` here to try to catch the exception by line. But I found it doesn't work: when the exception is thrown, the rest of the block is skipped and next readline() returns the next block. Python could have set the seek position to the previous newline delimiter in the original block so there is no skipping, but somehow it wasn't designed this way. – flow2k May 29 '19 at 06:26
  • @flow2k: newline detection is a completely separate task done after decoding. There is no special handling in the error handlers for this. Either decoding the block succeeds or it fails, and if it succeeds a later stage can detect line separators and produce individual string objects for each line. All that a different error handler does is influence how bad data is handled when decoding. – Martijn Pieters May 29 '19 at 08:47
  • @flow2k: also, it is entirely possible to corrupt the remainder of your input stream by dropping or inserting invalid bytes. That means it is *impossible* to know anything about line separator characters and so about lines, to attribute errors to. – Martijn Pieters May 29 '19 at 08:50
  • @flow2k last but not least: file data and other streams have *no concept of lines*, only of a sequence of bytes. Only when you interpret those bytes (assign meaning to them via a codec) can you start to designate some of those bytes (or a specific sequence of bytes) as a line separator, and everything between the line separators as lines. That all means that without decoding, there are no lines. If decoding fails, you can’t say, with 100% accuracy for all inputs, what line an error applies to. – Martijn Pieters May 29 '19 at 08:54
  • 1
    @flow2k what the surrogateescape approach gives you is that you say to the decoder: _please soldier on, give me the bad data wrapped up in special codepoints, **and hope for the best**. We’ll just trust that what comes after isn’t too badly corrupted and we can pretend that line separators are still line separators._ – Martijn Pieters May 29 '19 at 08:57
  • @MartijnPieters What you are saying makes sense. But why can't we also say to the `strict` decoder: *you see bad data, okay, but please soldier on until you see what appears to be a line separator, and then set the seek position there. After you've done that, throw an Exception. I wanted you to set the seek position because the next call to readline() can start immediately after the line separator*. – flow2k May 30 '19 at 00:30
  • @flow2k: this is going round in circles now. No, you can't say that to a decoder because decoders have no knowledge of line separators. The error happens in a *block of bytes*, it can be before a line separator or after. There can be many line separators or zero. The decoder should not care nor can it. You can't ask a decoder to continue to a next line separator because there might not *be* a next line. Decoders are engineered to also work on streaming data (say, from a network connection), and so don't know how much data is still to follow, or when it'll be available. – Martijn Pieters May 31 '19 at 15:35