0

I have a file with an old format from the 70s used in Companies House (UK company registry).

I inherited a parser written 6 years ago which goes line by line and according to a set of conditions extracts the information from the line and inserts them into a dictionary.

There is a weird character that is breaking a line.

I copied this line to a new file awk '{if(NR==33411) print $0}' PROD216_1950_ew_1.dat > broken and opend broken in vim.

Turns out that weird character is read by vim a <85>.

The result is that everything after MAYFIELD is read as a new line.

Below the line in question:

000376702103032986930001        1993010119941024        193709          0105<BARRY ALEXANDER<GROSVENOR<<<<MAYFIELD 3<41 PLANTATION ROAD<THE PEAK<<HONG KONG<BANK EXECUTIVE<BRITISH<<

in vim becomes

000376702103032986930001        1993010119941024        193709          0105<BARRY ALEXANDER<GROSVENOR<<<<MAYFIELD <85>3<41 PLANTATION ROAD<THE PEAK<<HONG KONG<BANK EXECUTIVE<BRITISH<<

I am using codecs to read this file with a context manager, which I thought was the way of going about it -

Is there anything I am missing? What is that <85>?

with codecs.open(filepath, 'r', 'utf-8') as fh:
    for line in fh:
        linetype = determine_line_type(line)
        if linetype == 'header':
            continue
        elif linetype == 'company':
            do stuff...
        elif linetype == 'officer':
            do stuff...
Tytire Recubans
  • 967
  • 10
  • 27
  • An alternative to using vim would be `od -c broken` – Michael Dyck Nov 07 '19 at 14:07
  • 1
    I suspect that what you actually have is a `cp1252`-encoded file that is being run through a `latin1`-to-`utf8` conversion process before it gets to you. That is, in true `cp1252` there was a byte 0x85, then after the `latin1`-to-`utf8` process there was the two-byte sequence 0xC2 0x85, which `vim` interpreted as the character `U+0085`. – Daniel Martin Nov 19 '19 at 16:35
  • Ooooh thanks. This makes a lot of sense. Thanks a lot. – Tytire Recubans Nov 19 '19 at 16:46

2 Answers2

1

vim shows <85> to indicate a hex 85 byte that is invalid in the current encoding (i.e., the encoding it's using to decode the file).

My guess is that the file's encoding is Windows-1252, in which hex 85 denotes the ellipsis character.

So the solution for your parser might be as simple as changing 'utf-8' to 'cp1252' in the codecs.open call.

Michael Dyck
  • 2,153
  • 1
  • 14
  • 18
  • this actually gives me the following error: `UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 60: character maps to ` – Tytire Recubans Nov 11 '19 at 13:51
  • @TytireRecubans if the goal is simply to read the file and not to get the proper characters, you can use `'latin-1'` for the encoding. It doesn't have any invalid bytes. – Mark Ransom Dec 11 '19 at 14:44
0

After going around for some time here and here I came up with this solution, which works.

with open(filepath, encoding='utf-8') as fh:

    for line in fh:

        byteline = bytearray(line, encoding='utf-8').replace(b'\xc2\x85', b'')
        line_clean = byteline.decode(encoding='utf-8')

       # do stuff with clean line. 

Knowing that the byte sequence that breaks the string is b'\xc2\x85' (it is interpreted as an ... ellipsis character.

First encode the string to an array of bytes with bytearray, then use replace method of the bytearray class, finally, decode the clean line using the decode method, which will return the string without the weird character from before the transformation.

Tytire Recubans
  • 967
  • 10
  • 27
  • You realize that using `encode` and `decode` with the same encoding string does absolutely nothing, right? You end up with the exact same string you started with. – Mark Ransom Nov 12 '19 at 16:54