0

I have a fixed width file which I'm working with in Python but when trying to load the file I get an error due to a hex value \x9f not being read. This is fixed by forcing the load of the file as latin-1 but when I try and replace the \x9f value its not working unless I write out to another file which doesn't seem efficient.
Can anyone advise a better way of doing this please?

import pprint
import re, collections
import platform


#### INPUTS ####
layout = [
        ('ID', 0, 11),
        ('FIN-STATEMENT-IND',   83  ,   84  )   ,
        ('RECENT-FIN-STAT-AGE', 84  ,   86  )   ,
        ('FAILED-TO-FILE-IND',  86  ,   87  )   ,
        ('FIN-STAT-OVDUE-IND',  87  ,   88  )   ,
        ('NET-WORTH'    , 88,   99  )   ,
        ]

headerdict = {}

#### OPEN ####
with open('uk_dcl_mrg.txt', 'r+', encoding='latin-1') as f:
    for line in f:
        f.write(line.replace('\x9f', '?'))


    ct = 0
    for line in f:
        ct += 1

        #### OUTOUT ####
        for i in layout:  ## Loop to create dictionary
            headerdict[i[0]] = line[i[1]:i[2]]


        print ('Sort by keys:')
        for key in sorted(headerdict.keys()):
            print ("%s: %s" % (key, headerdict[key]))    
        print(headerdict)
        # print(platform.python_version())
        if ct >= 1:
            break

If I add the line below so I can write to a second file and then create the dictionary from this it works fine but I don't want to create a second file.

with open('uk_dcl_mrg_out.txt', 'r+', encoding='latin-1') as fo:
Leigh
  • 23
  • 1
  • 6
  • I had the same problem when I switched from 2.7 to 3.x. I was not able to find a solution other than specifying latin-1 as format. – tnknepp Dec 21 '16 at 14:42
  • `\x9f` can be part of 2-bytes char. When you use `latin-1` then 2-bytes char is converted to single unicode char which have different code. Now you can save it with different encoding - ie. `utf-8`. If you sure that replacing `\x9f` with `?` you doesn't destroy file then you can always read in binary mode `br+` and then you get it as single bytes. – furas Dec 21 '16 at 14:52
  • http://stackoverflow.com/questions/19056863/decode-an-encoded-unicode-string-in-python – furas Dec 21 '16 at 14:59
  • http://effbot.org/zone/unicode-gremlins.htm : `u"\x9F": u"\u0178", # LATIN CAPITAL LETTER Y WITH DIAERESIS` – furas Dec 21 '16 at 15:01
  • I get all of the above but what I don't understand is why if I try and write the output back to the same file it doesn't work (see main code) but if I add in the code to write to a separate file it works fine. Surely it shouldn't matter where you write the output too? – Leigh Dec 22 '16 at 09:55

0 Answers0