I have a file with millions of lines, each of which is a list of integers (these sublists are in the range of tens to hundreds of items). What I want is to read through the file contents once and create 3 numpy arrays -- one with the average of each sublist, one with the length of each sublist, and one which is a flattened list of all the values in all the sublists.
If I just wanted one of these things, I'd do something like:
counts = np.fromiter((len(json.loads(line.rstrip())) for line in mystream), int)
but if I write 3 of those, my code would iterate through my millions of sublists 3 times, and I obviously only want to iterate through them once. So I want to do something like this:
averages = []
counts = []
allvals = []
for line in mystream:
sublist = json.loads(line.rstrip())
averages.append(np.average(sublist))
counts.append(len(sublist))
allvals.extend(sublist)
I believe that creating regular arrays as above and then doing
np_averages = np.array(averages)
Is very inefficient (basically creating the list twice). What is the right/efficient way to iteratively create a numpy array if it's not practical to use fromiter? Or do I want to create a function that returns the 3 values and do something like list comprehension for multiple return function? with fromiter instead of traditional list comprehension?
Or would it be efficient to create a 2D array of
[[count1, average1, sublist1], [count1, average2, sublist2], ...]
and then doing additional operations to slice off (and in the 3rd case also flatten) the columns as their own 1D arrays?