Two files. One with broken data, the other with fixes. Broken:
ID 0
T5 rat cake
~EOR~
ID 1
T1 wrong segg
T2 wrong nacob
T4 rat tart
~EOR~
ID 3
T5 rat pudding
~EOR~
ID 4
T1 wrong sausag
T2 wrong mspa
T3 strawberry tart
~EOR~
ID 6
T5 with some rat in it
~EOR~
Fixes:
ID 1
T1 eggs
T2 bacon
~EOR~
ID 4
T1 sausage
T2 spam
T4 bereft of loif
~EOR~
EOR means end of record. Note that the Broken file has more records than the fix file, which has tags (T1, T2 etc are tags) to fix and tags to add. This code does exactly what it's supposed to do:
# foobar.py
import codecs
source = 'foo.dat'
target = 'bar.dat'
result = 'result.dat'
with codecs.open(source, 'r', 'utf-8_sig') as s, \
codecs.open(target, 'r', 'utf-8_sig') as t, \
codecs.open(result, 'w', 'utf-8_sig') as u:
sID = ST1 = sT2 = sT4 = ''
RecordFound = False
# get source data, record by record
for sline in s:
if sline.startswith('ID '):
sID = sline
if sline.startswith('T1 '):
sT1 = sline
if sline.startswith('T2 '):
sT2 = sline
if sline.startswith('T4 '):
sT4 = sline
if sline.startswith('~EOR~'):
for tline in t:
# copy target file lines, replacing when necesary
if tline == sID:
RecordFound = True
if tline.startswith('T1 ') and RecordFound:
tline = sT1
if tline.startswith('T2 ') and RecordFound:
tline = sT2
if tline.startswith('~EOR~') and RecordFound:
if sT4:
tline = sT4 + tline
RecordFound = False
u.write(tline)
break
u.write(tline)
for tline in t:
u.write(tline)
I'm writing to a new file because I don't want to mess up the other two. The first outer for loop finishes on the last record in the fixes file. At that point, there are still records to write in the target file. That's what the last for-clause does.
What's nagging me that this last line implicitly picks up where the first inner for loop was last broken out of. It's as if it should say `for the rest of tline in t'. On the other hand, I don't see how I could do this with fewer (or not much more) lines of code (using dicts and what have you). Should I worry at all?
Please comment.