12

I have a text file which the publisher (the US Securities Exchange Commission) asserts is encoded in UTF-8 (https://www.sec.gov/files/aqfs.pdf, section 4). I'm processing the lines with the following code:

def tags(filename):
    """Yield Tag instances from tag.txt."""
    with codecs.open(filename, 'r', encoding='utf-8', errors='strict') as f:
        fields = f.readline().strip().split('\t')
        for line in f.readlines():
            yield process_tag_record(fields, line)

I receive the following error:

Traceback (most recent call last):
  File "/home/randm/Projects/finance/secxbrl.py", line 151, in <module>
    main()
  File "/home/randm/Projects/finance/secxbrl.py", line 143, in main
    all_tags = list(tags("tag.txt"))
  File "/home/randm/Projects/finance/secxbrl.py", line 109, in tags
    content = f.read()
  File "/home/randm/Libraries/anaconda3/lib/python3.6/codecs.py", line 698, in read
    return self.reader.read(size)
  File "/home/randm/Libraries/anaconda3/lib/python3.6/codecs.py", line 501, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 3583587: invalid start byte

Given that I probably can't go back to the SEC and tell them they have files that don't seem to be encoded in UTF-8, how should I debug and catch this error?

What have I tried

I did a hexdump of the file and found that the offending text was the text "SUPPLEMENTAL DISCLOSURE OF NON�CASH INVESTING". If I decode the offending byte as a hex code point (i.e. "U+00AD"), it makes sense in context as it is the soft hyphen. But the following does not seem to work:

Python 3.5.2 (default, Nov 17 2016, 17:05:23)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> b"\x41".decode("utf-8")
'A'
>>> b"\xad".decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec cant decode byte 0xad in position 0: invalid start byte
>>> b"\xc2ad".decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec cant decode byte 0xc2 in position 0: invalid continuation byte

I've used errors='replace', which seems to pass. But I'd like to understand what will happen if I try to insert that into a database.

Hexdump:

0036ae40  31 09 09 09 09 53 55 50  50 4c 45 4d 45 4e 54 41  |1....SUPPLEMENTA|
0036ae50  4c 20 44 49 53 43 4c 4f  53 55 52 45 20 4f 46 20  |L DISCLOSURE OF |
0036ae60  4e 4f 4e ad 43 41 53 48  20 49 4e 56 45 53 54 49  |NON.CASH INVESTI|
0036ae70  4e 47 20 41 4e 44 20 46  49 4e 41 4e 43 49 4e 47  |NG AND FINANCING|
0036ae80  20 41 43 54 49 56 49 54  49 45 53 3a 09 0a 50 72  | ACTIVITIES:..Pr|
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
MikeRand
  • 4,788
  • 9
  • 41
  • 70
  • 1
    In Python 3.6, do not use `codecs.open()`. The standard `open()` function can handle encoded data better and faster. – Martijn Pieters Sep 12 '17 at 15:48
  • What does the actual hexdump show? *In hexadecimal*, not as ASCII plus replacement character. The U+00AD byte would be encoded as two bytes, 0xC2 0xAD, and so you are missing the 0xC2 byte. – Martijn Pieters Sep 12 '17 at 15:50
  • How did you obtain the data file? If there is a byte missing there, there could be other data corruption too. – Martijn Pieters Sep 12 '17 at 15:51
  • Next, I'd not use `readline()` and `readlines()` calls either; use `fields = next(f).strip().split('\t')` and `for line in f:`. That avoids reading the whole file into memory at once before processing each line. – Martijn Pieters Sep 12 '17 at 15:54
  • Good advice, thanks Martijn. Data file is obtained by downloading and extracting the following zip file: https://www.sec.gov/files/dera/data/financial-statement-data-sets/2017q2.zip. – MikeRand Sep 12 '17 at 16:05
  • I too get a decoding error, just in a *different location*. – Martijn Pieters Sep 12 '17 at 16:13
  • The data there has `pertaining to Hotel Kranichh\xf6he.` in binary, indicating that the data is really encoded as Latin-1 instead. – Martijn Pieters Sep 12 '17 at 16:15
  • Apologies, I was using the following file: https://www.sec.gov/files/dera/data/financial-statement-data-sets/2017q1.zip – MikeRand Sep 12 '17 at 16:24
  • Actually, the same string appears in that one too, later on in the file. – Martijn Pieters Sep 12 '17 at 16:27
  • The file is a bit rough. There are also information separators 1C and 1D that seem to be causing line-break issues later in the file. That's my next issue to handle. – MikeRand Sep 12 '17 at 16:30
  • Those are probably meant to be fancy quotes, U+201C and U+201D! The plot thickens. Not Latin-1 then either. – Martijn Pieters Sep 12 '17 at 16:35
  • It is clear the file is **not** UTF-8 encoded, the SEC has messed up their encoding here. – Martijn Pieters Sep 12 '17 at 16:48
  • Yes, in context, fancy quotes make sense (given that it is an abbreviation of a longer term). Is there a way to loop over all encodings supported by Python where U+201C is encoded into 0x1c? I tried to find it in `codecs` but couldn't see where I could loop over the codec registry. – MikeRand Sep 12 '17 at 16:53
  • There are libraries that can determine the actual encoding for documents with quite high (but not 100%) reliability. Check out this question: https://stackoverflow.com/questions/436220/determine-the-encoding-of-text-in-python – Håken Lid Sep 12 '17 at 16:54
  • @MikeRand: I just tried, there is no such codec. – Martijn Pieters Sep 12 '17 at 16:57
  • 2
    @HåkenLid: except there is no known encoding that can produce the output the SEC produced. They have produced an invalid codec. – Martijn Pieters Sep 12 '17 at 16:57

1 Answers1

11

You have a corrupted data file. If that character really is meant to be a U+00AD SOFT HYPHEN, then you are missing a 0xC2 byte:

>>> '\u00ad'.encode('utf8')
b'\xc2\xad'

Of all the possible UTF-8 encodings that end in 0xAD, a soft hyphen does make the most sense. However, it is indicative of a data set that may have other bytes missing. You just happened to have hit one that matters.

I'd go back to the source of this dataset and verify that the file was not corrupted when downloaded. Otherwise, using error='replace' is a viable work-around, provided no delimiters (tabs, newlines, etc.) are missing.

Another possibility is that the SEC is really using a different encoding for the file; for example in Windows Codepage 1252 and Latin-1, 0xAD is the correct encoding of a soft hyphen. And indeed, when I download the same dataset directly (warning, large ZIP file linked), and open tags.txt, I can't decode the data as UTF-8:

>>> open('/tmp/2017q1/tag.txt', encoding='utf8').read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 3583587: invalid start byte
>>> from pprint import pprint
>>> f = open('/tmp/2017q1/tag.txt', 'rb')
>>> f.seek(3583550)
3583550
>>> pprint(f.read(100))
(b'1\t1\t\t\t\tSUPPLEMENTAL DISCLOSURE OF NON\xadCASH INVESTING AND FINANCING A'
 b'CTIVITIES:\t\nProceedsFromSaleOfIn')

There are two such non-ASCII characters in the file:

>>> f.seek(0)
0
>>> pprint([l for l in f if any(b > 127 for b in l)])
[b'SupplementalDisclosureOfNoncashInvestingAndFinancingActivitiesAbstract\t0'
 b'001654954-17-000551\t1\t1\t\t\t\tSUPPLEMENTAL DISCLOSURE OF NON\xadCASH I'
 b'NVESTING AND FINANCING ACTIVITIES:\t\n',
 b'HotelKranichhheMember\t0001558370-17-001446\t1\t0\tmember\tD\t\tHotel Krani'
 b'chhhe [Member]\tRepresents information pertaining to Hotel Kranichh\xf6h'
 b'e.\n']

Hotel Kranichh\xf6he decoded as Latin-1 is Hotel Kranichhöhe.

There are also several 0xC1 / 0xD1 pairs in the file:

>>> f.seek(0)
0
>>> quotes = [l for l in f if any(b in {0x1C, 0x1D} for b in l)]
>>> quotes[0].split(b'\t')[-1][50:130]
b'Temporary Payroll Tax Cut Continuation Act of 2011 (\x1cTCCA\x1d) recognized during th'
>>> quotes[1].split(b'\t')[-1][50:130]
b'ributory defined benefit pension plan (the \x1cAetna Pension Plan\x1d) to allow certai'

I'm betting those are really U+201C LEFT DOUBLE QUOTATION MARK and U+201D RIGHT DOUBLE QUOTATION MARK characters; note the 1C and 1D parts. It almost feels as if their encoder took UTF-16 and stripped out all the high bytes, rather than encode to UTF-8 properly!

There is no codec shipping with Python that would encode '\u201C\u201D' to b'\x1C\x1D', making it all the more likely that the SEC has botched their encoding process somewhere. In fact, there are also 0x13 and 0x14 characters that are probably en and em dashes (U+2013 and U+2014), as well as 0x19 bytes that are almost certainly single quotes (U+2019). All that is missing to complete the picture is a 0x18 byte to represent U+2018.

If we assume that the encoding is broken, we can attempt to repair. The following code would read the file and fix the quotes issues, assuming that the rest of the data does not use characters outside of Latin-1 apart from the quotes:

_map = {
    # dashes
    0x13: '\u2013', 0x14: '\u2014',
    # single quotes
    0x18: '\u2018', 0x19: '\u2019',
    # double quotes
    0x1c: '\u201c', 0x1d: '\u201d',
}
def repair(line, _map=_map):
    """Repair mis-encoded SEC data. Assumes line was decoded as Latin-1"""
    return line.translate(_map)

then apply that to lines you read:

with open(filename, 'r', encoding='latin-1') as f:
    repaired = map(repair, f)
    fields = next(repaired).strip().split('\t')
    for line in repaired:
        yield process_tag_record(fields, line)

Separately, addressing your posted code, you are making Python work harder than it needs to. Don't use codecs.open(); that's legacy code that has known issues and is slower than the newer Python 3 I/O layer. Just use open(). Do not use f.readlines(); you don't need to read the whole file into a list here. Just iterate over the file directly:

def tags(filename):
    """Yield Tag instances from tag.txt."""
    with open(filename, 'r', encoding='utf-8', errors='strict') as f:
        fields = next(f).strip().split('\t')
        for line in f:
            yield process_tag_record(fields, line)

If process_tag_record also splits on tabs, use a csv.reader() object and avoid splitting each row manually:

import csv

def tags(filename):
    """Yield Tag instances from tag.txt."""
    with open(filename, 'r', encoding='utf-8', errors='strict') as f:
        reader = csv.reader(f, delimiter='\t')
        fields = next(reader)
        for row in reader:
            yield process_tag_record(fields, row)

If process_tag_record combines the fields list with the values in row to form a dictionary, just use csv.DictReader() instead:

def tags(filename):
    """Yield Tag instances from tag.txt."""
    with open(filename, 'r', encoding='utf-8', errors='strict') as f:
        reader = csv.DictReader(f, delimiter='\t')
        # first row is used as keys for the dictionary, no need to read fields manually.
        yield from reader
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • I have to a do a bit more work in process_tag_record than just zipping and returning (e.g. converting data to Python data types, creating a SQLAlchemy instance), but yes, that would work better if it were just a zip and return. – MikeRand Sep 12 '17 at 16:11
  • 1
    Per your "UTF-16 with high bytes stripped", that's exactly what it looks like. There are also single quotes and em and en dashes that follow the same pattern. – Mark Tolonen Sep 12 '17 at 16:59
  • @MarkTolonen: indeed; I found 5 different 'anomalies', bytes between 0x00 and 0x1F (excluding 0x09 and 0x0A, or newline and tab). – Martijn Pieters Sep 12 '17 at 17:20
  • In the general case, if you have a single byte and at least a vague hypothesis of what it might represent, https://cdn.rawgit.com/tripleee/8bit/master/encodings.html allows you to look up the possible values in the various legacy 8-bit encodings known to Python. – tripleee Jan 17 '19 at 09:16
  • 1
    @tripleee: interesting. I generally use the fileformat.info [characterset](http://www.fileformat.info/info/charset/) and [unicode](https://www.fileformat.info/info/unicode/) pages to cross-reference characters; they have [comprehensive, per-codepoint listings of character sets](https://www.fileformat.info/info/unicode/char/20ac/charset_support.htm) and [Windows codepages](https://www.fileformat.info/info/unicode/char/20ac/codepage_support.htm) to check against. And for other broken encoding problems the [`ftfy` project](https://ftfy.readthedocs.io/en/latest/) is invaluable. – Martijn Pieters Jan 17 '19 at 11:25
  • @tripleee: and I only now noticed that your links go to fileformat.info :-D – Martijn Pieters Jan 17 '19 at 11:32
  • 1
    @tripleee: thanks again for that page, it [was helpful in finding a weird codec once more](https://stackoverflow.com/a/54631457/100297). – Martijn Pieters Feb 11 '19 at 13:16
  • I can't fix the broken link, but I can point to the new location: https://tripleee.github.io/8bit – tripleee Mar 02 '21 at 14:21