Can someone please explain the groupby operation and the lambda function being used on this SO post?
key=lambda k, line=count(): next(line) // chunk
import tempfile
from itertools import groupby, count
temp_dir = tempfile.mkdtemp()
def tempfile_split(filename, temp_dir, chunk=4000000):
with open(filename, 'r') as datafile:
# The itertools.groupby() function takes a sequence and a key function,
# and returns an iterator that generates pairs.
# Each pair contains the result of key_function(each item) and
# another iterator containing all the items that shared that key result.
groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk)
for k, group in groups:
print(key, list(group))
output_name = os.path.normpath(os.path.join(temp_dir + os.sep, "tempfile_%s.tmp" % k))
for line in group:
with open(output_name, 'a') as outfile:
outfile.write(line)
Edit: It took me a while to wrap my head around the lambda function used with groupby. I don't think I understood either of them very well.
Martijn explained it really well, however I have a follow up question. Why is line=count()
passed as an argument to the lambda function every time? I tried assigning the variable line
to count()
just once, outside the function.
line = count()
groups = groupby(datafile, key=lambda k, line: next(line) // chunk)
and it resulted in TypeError: <lambda>() missing 1 required positional argument: 'line'
Also, calling next
on count()
directly within the lambda expression, resulted in all the lines in the input file getting bunched together i.e a single key was generated by the groupby
function.
groups = groupby(datafile, key=lambda k: next(count()) // chunk)
I'm learning Python on my own, so any help or pointers to reference materials /PyCon talks are much appreciated. Anything really!