How can I elegantly combine/concat files by section with python?

Question

Like many an unfortunate programmer soul before me, I am currently dealing with an archaic file format that refuses to die. I'm talking ~1970 format specification archaic. If it were solely up to me, we would throw out both the file format and any tool that ever knew how to handle it, and start from scratch. I can dream, but that unfortunately that won't resolve my issue.

The format: Pretty Loosely defined, as years of nonsensical revisions have destroyed almost all back compatibility it once had. Basically, the only constant is that there are section headings, with few rules about what comes before or after these lines. The headings are sequential (e.g. HEADING1, HEADING2, HEADING3,...), but not numbered and are not required (e.g HEADING1, HEADING3, HEADING7). Thankfully, all possible heading permutations are known. Here's a fake example:

# Bunch of comments

SHOES # First heading
# bunch text and numbers here

HATS # Second heading
# bunch of text here

SUNGLASSES # Third heading
...

My problem: I need to concatenate multiple of these files by these section headings. I have a perl script that does this quite nicely:

while(my $l=<>) {

    if($l=~/^SHOES/i) { $r=\$shoes; name($r);}
    elsif($l=~/^HATS/i) { $r=\$hats; name($r);}
    elsif($l=~/^SUNGLASSES/i) { $r=\$sung; name($r);}
    elsif($l=~/^DRESS/i || $l=~/^SKIRT/i ) { $r=\$dress; name($r);}
    ...
    ...
    elsif($l=~/^END/i) { $r=\$end; name($r);}
    else {
        $$r .= $l;
    }
    print STDERR "Finished processing $ARGV\n" if eof;
}

As you can see, with the perl script I basically just change where a reference points to when I get to a certain pattern match, and concatenate each line of the file to its respective string until I get to the next pattern match. These are then printed out later as one big concated file.

I would and could stick with perl, but my needs are becoming more complex every day and I would really like to see how this problem can be solved elegantly with python (can it?). As of right now my method in python is basically to load the entire file as a string, search for the heading locations, then split up the string based on the heading indices and concat the strings. This requires a lot of regex, if-statements and variables for something that seems so simple in another language.

It seems that this really boils down to a fundamental language issue. I found a very nice SO discussion about python's "call-by-object" style as compared with that of other languages that are call-by-reference. How do I pass a variable by reference? Yet, I still can't think of an elegant way to do this in python. If anyone can help kick my brain in the right direction, it would be greatly appreciated.

What makes you think you need call-by-reference here? Nothing in your description seems to imply it would be useful. If you showed us the code, we could show you how to do it (or maybe offer a better solution at a higher level), but in the abstract, we can't really tell you anything other than the link you already found. — abarnert, Feb 18 '13 at 23:29
The natural question in my mind would be "How do I rewrite this elegantly in perl?" What makes you think python is better suited for complex tasks than perl? — TLP, Feb 18 '13 at 23:38
@TLP: There are plenty of good reasons why it might be worth porting this. Maybe the OP is much more comfortable with Python than with Perl, or he's working on a team with a lot more Python skills, or… But you're right, without some such reason, porting just for the sake of porting is pointless. — abarnert, Feb 19 '13 at 00:38

ikegami · Answer 1 · 2013-02-18T23:54:12.990

2

That's not even elegant Perl.

my @headers = qw( shoes hats sunglasses dress );

my $header_pat = join "|", map quotemeta, @headers;
my $header_re = qr/$header_pat/i;

my ( $section, %sections );
while (<>) {
    if    (/($header_re)/) { name( $section = \$sections{$1     } ); }
    elsif (/skirt/i)       { name( $section = \$sections{'dress'} ); }
    else { $$section .= $_; }

    print STDERR "Finished processing $ARGV\n" if eof;
}

Or if you have many exceptions:

my @headers = qw( shoes hats sunglasses dress );
my %aliases = ( 'skirt' => 'dress' );

my $header_pat = join "|", map quotemeta, @headers, keys(%aliases);
my $header_re = qr/$header_pat/i;

my ( $section, %sections );
while (<>) {
    if (/($header_re)/) {
       name( $section = \$sections{ $aliases{$1} // $1 } );
    } else {
       $$section .= $_;
    }

    print STDERR "Finished processing $ARGV\n" if eof;
}

Using a hash saves the countless my declarations you didn't show.

You could also do $header_name = $1; name(\$sections{$header_name}); and $sections{$header_name} .= $_ for a bit more readability.

edited Feb 18 '13 at 23:54

answered Feb 18 '13 at 23:47

ikegami

367,544
15
269
518

My `perl` is pretty rusty, but doesn't just using old-school perl without strict also let him skip the countless my declarations (at the cost of less readable and less robust code, but we already know that the perl is legacy code he doesn't want to maintain, so that's not too implausible)? – abarnert Feb 18 '13 at 23:55
1

Who said it would be, or is, elegant code? It's _legacy code that he doesn't want to maintain_. – abarnert Feb 19 '13 at 00:00
1

@abarnert Perhaps the point is that you save a bunch of scalar variables (`$shoes`, `$hats`, `$dress` etc). – TLP Feb 19 '13 at 00:14
@TLP: I think that's a much better argument for this answer. Using a `hash` is better because it's inherently the right way to handle problems like this—there's just no reason to have those separate variables in the first place. It's not a coincidence that all of the Python solutions suggested by 4 different people used a `dict`. And, even though in perl TMTOOWTDI instead of TOOWTDI, sometimes one way is just obviously right anyway… – abarnert Feb 19 '13 at 00:36
@abarnert, Did you actually read that to which you replied? The OP said his code was elegant. I posted that it wasn't particularly so and showed him what elegant code would look like. And then you say my demonstration of elegant code shouldn't have ben strict-safe? Enough of this nonsense! – ikegami Feb 19 '13 at 00:50
@ikegami. The OP did _not_ say his perl code was elegant; he said that he can't think of a way to write the python code elegantly. And I did _not_ say your code shouldn't have been strict-safe, I said that his existing code probably isn't strict-safe. I am perfectly capable of reading what's written, but I am not able to read things that don't exist, and I'm sorry that you expect me to. – abarnert Feb 19 '13 at 00:59
@abarnert, He did ("does this quite nicely"), but it doesn't even matter. MY CODE IS A DEMONSTRATION OF WHAT ELENGANT CODE WOULD BE. Stop this nonsense! It would NOT be appropriate to use undeclared globals. It would not be appropriate to change to the different language Perl is without `use strict;`. It would not be appropriate to remove the checks `use strict;` adds. It is incredible that you would think that, much less mention it! – ikegami Feb 19 '13 at 01:11
@ikegami: You don't seem to understand what I wrote. Once again, I am not suggesting that you should use undeclared globals, or that he should, but that his existing legacy code likely _does_. In other words, even though there _should be_ countless my declarations in his code, there probably aren't, and therefore offering to remove them is not the right argument. Meanwhile, I'm not sure why you're getting so defensive that you need to shout in all caps. I already gave you a +1, and explained why, but if that makes you angry, I can remove my upvote. – abarnert Feb 19 '13 at 01:29

abarnert · Answer 2 · 2013-02-18T23:48:11.220

I'm not sure if I understand your whole problem, but this seems to do everything you need:

import sys

headers = [None, 'SHOES', 'HATS', 'SUNGLASSES']
sections = [[] for header in headers]

for arg in sys.argv[1:]:
    section_index = 0
    with open(arg) as f:
        for line in f:
            if line.startswith(headers[section_index + 1]):
                section_index = section_index + 1
            else:
                sections[section_index].append(line)

Obviously you could change this to read or mmap the whole file, then re.search or just buf.find for the next header. Something like this (untested pseudocode):

import sys

headers = [None, 'SHOES', 'HATS', 'SUNGLASSES']
sections = defaultdict(list)

for arg in sys.argv[1:]:
    with open(arg) as f:
        buf = f.read()
    section = None
    start = 0
    for header in headers[1:]:
        idx = buf.find('\n'+header, start)
        if idx != -1:
            sections[section].append(buf[start:idx])
            section = header
            start = buf.find('\n', idx+1)
            if start == -1:
                break
    else:
        sections[section].append(buf[start:])

And there are plenty of other alternatives, too.

But the point is, I can't see anywhere where you'd need to pass a variable by reference in any of those solutions, so I'm not sure where you're stumbling on whichever one you've chosen.

So, what if you want to treat two different headings as the same section?

Easy: create a dict mapping headers to sections. For example, for the second version:

headers_to_sections = {None: None, 'SHOES': 'SHOES', 'HATS': 'HATS',
                       'DRESSES': 'DRESSES', 'SKIRTS': 'DRESSES'}

Now, in the code that doessections[section], just do sections[headers_to_sections[section]].

For the first, just make this a mapping from strings to indices instead of strings to strings, or replace sections with a dict. Or just flatten the two collections by using a collections.OrderedDict.

@ikegami: I'm sorry, but in what way does it not handle SKIRT properly? Obviously you have to put `'SKIRT'` into `headers` if you want it to search for that. If your point is that you want to merge two headers into one section, I can show how to do that with both versions. — abarnert, Feb 18 '13 at 23:44

isedev · Answer 3 · 2013-02-18T23:55:43.003

0

Assuming you're reading from stdin, as in the perl script, this should do it:

import sys
import collections
headings = {'SHOES':'SHOES','HATS':'HATS','DRESS':'DRESS','SKIRT':'DRESS'} # etc...
sections = collections.defaultdict(str)
key = None
for line in sys.stdin:
    sline = line.strip()
    if sline not in headings:
        sections[headings.get(key)].append(sline)
    else:
        key = sline

You'll end up with a dictionary where like this:

{
    None: <all lines as a single string before any heading>
    'HATS' : <all lines as a single string below HATS heading and before next heading> ],
    etc...
}

The headings list does not have to be defined in the some order as the headings appear in the input.

edited Feb 18 '13 at 23:55

answered Feb 18 '13 at 23:37

isedev

18,848
3
60
59

Doesn't handle SKIRT properly. – ikegami Feb 18 '13 at 23:43
But the perl script reads from all paths passed into `ARGV`, and only reads `stdin` if you don't pass any paths. (And if you fix that, your script is pretty much identical to my first version.) – abarnert Feb 18 '13 at 23:43
@abarnert except ikegami is right... neither your solution or mine handles SKIRT properly – isedev Feb 18 '13 at 23:44
@isedev: If I understand what he's saying, that's an incredibly trivial fix, which I already added to my answer. (And I still don't see why he thinks he needs pass-by-reference, mountains of if statements, etc. to do it. It's just a `dict`.) – abarnert Feb 18 '13 at 23:49
same here... argh, you keep beating me to it by 45 seconds :) – isedev Feb 18 '13 at 23:50
1

@isedev: PS, you should never use `readlines`. Why not just iterate `for line in sys.stdin`, which has the exact same effect but without generating a giant `list`? If you _do_ need the `list`, just do `list(sys.stdin)`, which is simpler and doesn't require a deprecated method. – abarnert Feb 18 '13 at 23:50
@isedev: Actually, I meant to write "you should never use `readlines()`"; there are still some cases where `readlines(sizehint)` is still useful. But nobody ever uses that, so… close enough. – abarnert Feb 18 '13 at 23:57
The new version will raise a `KeyError` if there's anything before the first section. Either add `None: None` to your `dict` (as in my answer), or change the `[key]` to a `.get(key)`. – abarnert Feb 18 '13 at 23:58

score 0 · Answer 4 · answered Feb 18 '13 at 23:47

My deepest sympathies!

Here's some code (please excuse minor syntax errors)

  def foundSectionHeader(l, secHdrs):
    for s in secHdrs:
      if s in l:
        return True
    return False

  def main():
    fileList = ['file1.txt', 'file2.txt', ...]
    sectionHeaders = ['SHOES', 'HATS', ...]
    sectionContents = dict()
    for section in sectionHeaders:
      sectionContents[section] = []
    for file in fileList:
      fp = open(file)
      lines = fp.readlines()
      idx = 0
      while idx < len(lines):
        sec = foundSectionHeader(lines[idx]):
        if sec:
          idx += 1
          while not foundSectionHeader(lines[idx], sectionHeaders):
            sectionContents[sec].append(lines[idx])
            idx += 1

This assumes that you don't have content lines which look like "SHOES"/"HATS" etc.

From a quick glance, this is exactly the same thing as the answer I already posted, but much more verbose and much less pythonic, and less robust, and not taking the filenames on argv the way the OP wanted… Is there something the other answers are missing? — abarnert, Feb 18 '13 at 23:53
No, I think I submitted right after you and failed to see your response. — Rahul Banerjee, Feb 19 '13 at 00:14

How can I elegantly combine/concat files by section with python?

4 Answers4