Resuming a nested for-loop

Question

Two files. One with broken data, the other with fixes. Broken:

ID 0
T5 rat cake
~EOR~
ID 1
T1 wrong segg
T2 wrong nacob
T4 rat tart
~EOR~
ID 3
T5 rat pudding
~EOR~
ID 4
T1 wrong sausag
T2 wrong mspa
T3 strawberry tart 
~EOR~
ID 6
T5 with some rat in it 
~EOR~

Fixes:

ID 1
T1 eggs
T2 bacon
~EOR~
ID 4
T1 sausage
T2 spam
T4 bereft of loif
~EOR~

EOR means end of record. Note that the Broken file has more records than the fix file, which has tags (T1, T2 etc are tags) to fix and tags to add. This code does exactly what it's supposed to do:

# foobar.py

import codecs

source = 'foo.dat'
target = 'bar.dat' 
result = 'result.dat'  

with codecs.open(source, 'r', 'utf-8_sig') as s, \
     codecs.open(target, 'r', 'utf-8_sig') as t, \
     codecs.open(result, 'w', 'utf-8_sig') as u: 

    sID = ST1 = sT2 = sT4 = ''
    RecordFound = False

    # get source data, record by record
    for sline in s:
        if sline.startswith('ID '):
            sID = sline
        if sline.startswith('T1 '):
            sT1 = sline
        if sline.startswith('T2 '):
            sT2 = sline
        if sline.startswith('T4 '):
            sT4 = sline
        if sline.startswith('~EOR~'):
            for tline in t: 
                # copy target file lines, replacing when necesary
                if tline == sID:
                    RecordFound = True
                if tline.startswith('T1 ') and RecordFound:
                    tline = sT1
                if tline.startswith('T2 ') and RecordFound:
                    tline = sT2 
                if tline.startswith('~EOR~') and RecordFound:
                    if sT4:
                        tline = sT4 + tline
                    RecordFound = False
                    u.write(tline)
                    break

                u.write(tline)

    for tline in t:
        u.write(tline)

I'm writing to a new file because I don't want to mess up the other two. The first outer for loop finishes on the last record in the fixes file. At that point, there are still records to write in the target file. That's what the last for-clause does.

What's nagging me that this last line implicitly picks up where the first inner for loop was last broken out of. It's as if it should say `for the rest of tline in t'. On the other hand, I don't see how I could do this with fewer (or not much more) lines of code (using dicts and what have you). Should I worry at all?

Please comment.

I would create a counter "tPosition" that you increase each time you move through the relevant loop. Then, when you want to say "for the rest of tline in t" you can indicate that you want to loop over something like: for tline in t[tPosition:] — duhaime, Nov 28 '13 at 17:52

score 2 · Answer 1 · edited May 23 '17 at 12:19

I wouldn't worry. In your example, t is a file handle and you are iterating over it. File handles in Python are their own iterators; they have state information about where they've read in the file and will keep their place as you iterate over them. You can check the python docs for file.next() for more info.

See also another SO answer that also talks about iterators:What does the "yield" keyword do in Python?. Lots of helpful information there!

Edit: Here's another way to combine them using dictionaries. This method may be desirable if you want to do other modifications to the records before you output:

import sys

def get_records(source_lines):
    records = {}
    current_id = None
    for line in source_lines:
        if line.startswith('~EOR~'):
            continue
        # Split the line up on the first space
        tag, val = [l.rstrip() for l in line.split(' ', 1)]
        if tag == 'ID':
            current_id = val
            records[current_id] = {}
        else:
            records[current_id][tag] = val
    return records

if __name__ == "__main__":
    with open(sys.argv[1]) as f:
        broken = get_records(f)
    with open(sys.argv[2]) as f:
        fixed = get_records(f)

    # Merge the broken and fixed records
    repaired = broken
    for id in fixed.keys():
        repaired[id] = dict(broken[id].items() + fixed[id].items())

    with open(sys.argv[3], 'w') as f:
        for id, tags in sorted(repaired.items()):
            f.write('ID {}\n'.format(id))
            for tag, val in sorted(tags.items()):
                f.write('{} {}\n'.format(tag, val))
            f.write('~EOR~\n')

The dict(broken[id].items() + fixed[id].items()) part takes advantage of this: How to merge two Python dictionaries in a single expression?

Thanks! The link to file.next() has the confirmation that I was seeking. I had already come across the yield explanation. — RolfBly, Nov 28 '13 at 20:21
It's not good to just skip `~EOR~`s. If there is no 'ID' after `~EOR~` line, you will corrupt data. In such situations you need to `raise`. — akaRem, Nov 29 '13 at 11:26
In line `repaired = broken`, `repaired` - is just an alias for `broken` (it is **the same dict**), so you mutate original data. Such code-style always brings bugs in future development. You need deep copy of `broken`. Or you **must rename** these variables. — akaRem, Nov 29 '13 at 11:31
Also `id` is built-in and you shade it. `tag, val = [l.rstrip() for l in line.split(' ', 1)]` will raise on lines removing data `'T1 '` -> raise — akaRem, Nov 29 '13 at 11:49
Thank you for you comments @akaRem. This was meant as a toy example showing some Python syntax and semantics that the OP might not have been aware of. I am assuming all inputs are well-formed and avoided including error-checks. — crennie, Nov 29 '13 at 17:25

akaRem · Answer 2 · 2014-02-06T21:06:15.173

1

# building initial storage

content = {}
record = {}
order = []
current = None

with open('broken.file', 'r') as f:
    for line in f:
        items = line.split(' ', 1)
        try:
            key, value = items
        except:
            key, = items
            value = None

        if key == 'ID':
            current = value
            order.append(current)
            content[current] = record = {}
        elif key == '~EOR~':
            current = None
            record = {}
        else:
            record[key] = value

# patching

with open('patches.file', 'r') as f:
    for line in f:
        items = line.split(' ', 1)
        try:
            key, value = items
        except:
            key, = items
            value = None

        if key == 'ID':
            current = value
            record = content[current]  # updates existing records only!
            # if there is no such id -> raises

            # alternatively you may check and add them to the end of list
            # if current in content: 
            #     record = content[current]
            # else:
            #     order.append(current)
            #     content[current] = record = {}

        elif key == '~EOR~':
            current = None
            record = {}
        else:
            record[key] = value

# patched!
# write-out

with open('output.file', 'w') as f:
     for current in order:
         out.write('ID '+current+'\n')
         record = content[current]
         for key in sorted(record.keys()):
             out.write(key + ' ' + (record[key] or '') + '\n')  

# job's done

questions?

edited Feb 06 '14 at 21:06

answered Nov 28 '13 at 18:51

akaRem

7,326
4
29
43

Thanks. I like your approach for handling the records. I guess it's indeed more Pythonic than mine. (I find Pythonicness a rather difficult subject. So much in there you can use, but can't find on your own). Your code crashes on EOR lines with 'need more than 1 value to unpack' and I guess `curent` should be `current`, but that's not important. No need for any discussion there. – RolfBly Nov 29 '13 at 07:47
@RolfBly I wrote this on PC without python installed, so.. I didn't test it. Sorry for mistakes. I'll fix them. – akaRem Nov 29 '13 at 11:12
@RolfBly I've added fixes – akaRem Nov 29 '13 at 11:44
It took a while before I had to try your code. The first `record[key] = value` results in `TypeError: 'NoneType' object does not support item assignment`. The output file remains empty, of course. – RolfBly Feb 03 '14 at 15:24
Also, the second for loop doesn't do any patching. – RolfBly Feb 03 '14 at 16:43
@RolfBly Ive added some edits. May be this will work. Sorry for errors. The main idea of my listing is __logic__ for how to implement your idea. My implementation may have some mistakes, because i didm test it. – akaRem Feb 06 '14 at 21:06
In the meantime, I've made something that's partly based on your approach. So thank you for the hints! My solution is below – RolfBly Feb 07 '14 at 21:19

score 0 · Answer 3 · answered Feb 07 '14 at 22:06

For the sake of completeness, and just to share my enthousiasm and what I learned, below is the code that I now work with. It answers my OP, and more.

It's based in part on akaRem's approach above. A single function fills a dict. It's called twice, once for the fixes file, once for the file-to-fix.

import codecs, collections
from GetInfiles import *

sourcefile, targetfile = GetInfiles('dat')
    # GetInfiles reads two input parameters from the command line,
    # verifies they exist as files with the right extension, 
    # and then returns their names. Code not included here. 

resultfile = targetfile[:-4] + '_result.dat'  

def recordlist(infile):
    record = collections.OrderedDict()
    reclist = []

    with codecs.open(infile, 'r', 'utf-8_sig') as f:
        for line in f:
            try:
                key, value = line.split(' ', 1)

            except:
                key = line 
                # so this line must be '~EOR~\n'. 
                # All other lines must have the shape 'tag: content\n'
                # so if this errors, there's something wrong with an input file

            if not key.startswith('~EOR~'):
                try: 
                    record[key].append(value)
                except KeyError:
                    record[key] = [value]

            else:
                reclist.append(record)
                record = collections.OrderedDict()

    return reclist

# put files into ordered dicts            
source = recordlist(sourcefile)
target = recordlist(targetfile)

# patching         
for fix in source:
    for record in target:
        if fix['ID'] == record['ID']:
            record.update(fix)

# write-out            
with codecs.open(resultfile, 'w', 'utf-8_sig') as f:
    for record in target:
        for tag, field in record.iteritems():
            for occ in field: 
                line = u'{} {}'.format(tag, occ)
                f.write(line)

        f.write('~EOR~\n')

It's now an ordered dict. This was not in my OP but the files needs to be cross-checked by humans, so keeping the order makes that easier. (Using OrderedDict is really easy. My first attempts at finding this functionality led me to odict, but its documentation worried me. No examples, intimidating jargon...)

Also, it now supports multiple occurrences of any given tag inside a record. This was not in my OP either, but I needed this. (That format is called 'Adlib tagged', it's catalogueing software.)

Different from akaRem's approach is the patching, using update for the target dict. I find this, as often with python, really and truly elegant. Likewise for startswith. These are two more reasons I can't resist sharing it.

I hope it's useful.

Resuming a nested for-loop

3 Answers3