I have a file with an old format from the 70s used in Companies House (UK company registry).
I inherited a parser written 6 years ago which goes line by line and according to a set of conditions extracts the information from the line and inserts them into a dictionary.
There is a weird character that is breaking a line.
I copied this line to a new file awk '{if(NR==33411) print $0}' PROD216_1950_ew_1.dat > broken
and opend broken
in vim.
Turns out that weird character is read by vim a <85>
.
The result is that everything after MAYFIELD
is read as a new line.
Below the line in question:
000376702103032986930001 1993010119941024 193709 0105<BARRY ALEXANDER<GROSVENOR<<<<MAYFIELD 3<41 PLANTATION ROAD<THE PEAK<<HONG KONG<BANK EXECUTIVE<BRITISH<<
in vim becomes
000376702103032986930001 1993010119941024 193709 0105<BARRY ALEXANDER<GROSVENOR<<<<MAYFIELD <85>3<41 PLANTATION ROAD<THE PEAK<<HONG KONG<BANK EXECUTIVE<BRITISH<<
I am using codecs
to read this file with a context manager, which I thought was the way of going about it -
Is there anything I am missing? What is that <85>
?
with codecs.open(filepath, 'r', 'utf-8') as fh:
for line in fh:
linetype = determine_line_type(line)
if linetype == 'header':
continue
elif linetype == 'company':
do stuff...
elif linetype == 'officer':
do stuff...