Inspired by @schmichael's valiant effort at a functional Python solution, here is my attempt at pushing things too far. I'm not claiming it's maintainable, efficient, exemplary, or scrutable, but it is functional:
from itertools import imap, groupby, izip, chain
from collections import deque
from operator import itemgetter, methodcaller
from functools import partial
def shifty_csv_dicts(lines):
last = lambda seq: deque(seq, maxlen=1).pop()
parse_header = lambda header: header[1:-1].split(',')
parse_row = lambda row: row.rstrip('\n').split(',')
mkdict = lambda keys, vals: dict(izip(keys,vals))
headers_then_rows = imap(itemgetter(1), groupby(lines, methodcaller('startswith', '#')))
return chain.from_iterable(imap(partial(mkdict, parse_header(last(headers))), imap(parse_row, next(headers_then_rows))) for headers in headers_then_rows)
Okay, let's unpack that.
The basic insight is to (ab)use itertools.groupby
to recognize changes from headers to data rows. We use argument evaluation semantics to control the order of operations.
First, we tell groupby
to group lines by whether or not they start with '#'
:
methodcaller('startswith', '#')
creates a function that takes a line and calls line.startswith('#')
(it is equivalent to the stylistically preferable but less efficient lambda line: line.startswith('#')
).
So groupby
takes the incoming iterable of lines
and alternates between returning an iterable of header lines (usually just one header), and an iterable of data rows. It actually returns a tuple of (group_val, group_iter)
, where in this case group_val
is a bool
indicating whether it's a header. So, we do the equivalent of (group_val, group_iter)[1]
on all of the tuples to pick out the iterators: itemgetter(1)
is just a function that runs "[1]
" on whatever you give it (again equivalent to but more efficient than lambda t: t[1]
). So we use imap
to run our itemgetter
function on every tuple returned by groupby
to pick out the header / data iterators:
imap(itemgetter(1), groupby(lines, methodcaller('startswith', '#')))
We evaluate that expression first and give it a name because we will use it twice later, first for the headers, then for the data. The outermost call:
chain.from_iterable(... for headers in headers_then_rows)
steps through the iterators returned from groupby
. We are being sly and calling the value headers
because some other code inside the ...
picks off the rows
when we're not looking, advancing the groupby
iterator in the process. This outer generator expression will only ever produce headers (remember, they alterate: headers, data, headers, data...). The trick is to make sure the headers get consumed before the rows, because they both share the same underlying iterator. chain.from_iterable
just stitches the results of all the data rows iterators together into One Iterator To Return Them All.
So what are we stitching together? Well, we need to take the (last) header, zip it with each row of values, and make a dicts out of that. This:
last = lambda seq: deque(seq, maxlen=1).pop()
is a somewhat dirty but efficient hack to get the last item from an iterator, in this case our header line. We then parse the header by trimming the leading #
and trailing newline, and split on ,
to get a list of column names:
parse_header = lambda header: header[1:-1].split(',')
But, we only want to do this once for each rows iterator, because it exhausts our headers iterator (and we wouldn't want to copy it into some mutable state now, would we?). We also have to ensure that the headers iterator gets used before the rows. The solution is to make a partially applied function, evaluating and fixing the headers as the first parameter, and taking a row as second parameter:
partial(mkdict, parse_header(last(headers)))
The mkdict
function uses the column names as keys and row data as values to make a dict:
mkdict = lambda keys, vals: dict(izip(keys,vals))
This gives us a function that freezes the first parameter (keys
) and lets you just pass the second parameter (vals
): just what we need for creating a bunch of dicts with the same keys and different values.
To use it, we parse each row like you'd expect:
parse_row = lambda row: row.rstrip('\n').split(',')
recalling that next(headers_then_rows)
will return the data rows iterator from groupby
(since we already used the headers iterator):
imap(parse_row, next(headers_then_rows))
Finally, we map our partially applied dict-maker function to the parsed rows:
imap(partial(...), imap(parse_row, next(headers_then_rows)))
And these are all stitched together by chain.from_iterable
to make one, big, happy, functional stream of shifty CSV dicts.
For the record, this could probably be simplified, and I would still do things @schmichael's way. But I learned things figuring this out, and I will try applying these ideas to a Scala solution.