1

I have a file with millions of lines, each of which is a list of integers (these sublists are in the range of tens to hundreds of items). What I want is to read through the file contents once and create 3 numpy arrays -- one with the average of each sublist, one with the length of each sublist, and one which is a flattened list of all the values in all the sublists.

If I just wanted one of these things, I'd do something like:

counts = np.fromiter((len(json.loads(line.rstrip())) for line in mystream), int)

but if I write 3 of those, my code would iterate through my millions of sublists 3 times, and I obviously only want to iterate through them once. So I want to do something like this:

averages = []
counts = []
allvals = []
for line in mystream:
    sublist = json.loads(line.rstrip())
    averages.append(np.average(sublist))
    counts.append(len(sublist))
    allvals.extend(sublist)

I believe that creating regular arrays as above and then doing

np_averages = np.array(averages)

Is very inefficient (basically creating the list twice). What is the right/efficient way to iteratively create a numpy array if it's not practical to use fromiter? Or do I want to create a function that returns the 3 values and do something like list comprehension for multiple return function? with fromiter instead of traditional list comprehension?

Or would it be efficient to create a 2D array of [[count1, average1, sublist1], [count1, average2, sublist2], ...] and then doing additional operations to slice off (and in the 3rd case also flatten) the columns as their own 1D arrays?

PurpleVermont
  • 1,179
  • 4
  • 18
  • 46

1 Answers1

1

First of all, the json library is not the most optimized library for that. You can use the pysimdjson package based on the optimized simdjson library to speed up the computation. For small integer lists, it is about twice faster on my machine.

Moreover, Numpy functions are not great for relatively small arrays as they introduce a pretty big overhead. For example, np.average takes about 8-10 us on my machine to compute an array of 20 items. Meanwhile, sum(sublist)/len(sublist) only takes 0.25-0.30 us.

Finally, np.array needs to iterate twice to convert the list into an array because it does not know the type of all objects. You can specify it so to make the convertion faster: np.array(averages, np.float64).

Here is a significantly faster implementation:

import simdjson

averages = []
counts = []
allvals = []

for line in mystream:
    sublist = simdjson.loads(line.rstrip())
    averages.append(sum(sublist) / len(sublist))
    counts.append(len(sublist))
    allvals.extend(sublist)

np_averages = np.array(averages, np.float64)

One issue with this implementation is that allvals will contain all the values in the form of a big list of objects. CPython objects are quite big in memory compared to native Numpy integers (especially compared to 32-bit=4bytes integers) since each object takes usually 32 bytes and the reference in the list takes usually 8 bytes (resulting in 40 bytes per items, that is to say 10 times more than Numpy 32-bit-integer-based arrays). Thus, I may be better to use a native implementation, possibly based on Cython.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59