I have a script that accumulates (counts) bytes contained in two files. Bytes are C-like unsigned char
integer values between 0 and 255.
The goal of this accumulator script is to count the joint counts or frequencies of bytes in those two files. Possibly to extend this out to multiple files/dimensions.
The two files are the same size, but they are very large, on the order of 6 TB or so.
I am using numpy.uint64
values, as I am getting overflow problems using Python's int
type.
I have a 1D accumulator array that is 255**2
in length, in order to store joint counts.
I calculate the offset from a row-by-column -to- array offset calculation, so as to increment the joint frequency at the right index. I walk through both files in chunks of bytes (n_bytes
), unpack them, and increment the frequency counter.
Here's a rough sketch of the code:
import numpy
import ctypes
import struct
buckets_per_signal_type = 2**(ctypes.c_ubyte(1).value * 8)
total_buckets = buckets_per_signal_type**2
buckets = numpy.zeros((total_buckets,), dtype=numpy.uint64)
# open file handles to two files (omitted for brevity...)
# buffer size that is known ahead of time to be a divisible
# unit of the original files
# (for example, here, reading in 2.4e6 bytes per loop iteration)
n_bytes = 2400000
total_bytes = 0L
# format used to unpack bytes
struct_format = "=%dB" % (n_bytes)
while True:
# read in n_bytes chunk from each file
first_file_bytes = first_file_handle.read(n_bytes)
second_file_bytes = second_file_handle.read(n_bytes)
# break if both file handles have nothing left to read
if len(first_file_bytes) == 0 and len(second_file_bytes) == 0:
break
# unpack actual bytes
first_bytes_unpacked = struct.unpack(struct_format, first_file_bytes)
second_bytes_unpacked = struct.unpack(struct_format, second_file_bytes)
for index in range(0, n_bytes):
first_byte = first_bytes_unpacked[index]
second_byte = second_bytes_unpacked[index]
offset = first_byte * buckets_per_signal_type + second_byte
buckets[offset] += 1
total_bytes += n_bytes
# repeat until both file handles are both EOF
# print out joint frequency (omitted)
Compared with the version where I used int
, this is incredibly slow, on an order of magnitude slower. The original job completed (incorrectly, due to overflow) in about 8 hours, and this numpy-based version had to be quit early as it will appear to take about 12-14 days to finish.
Either numpy is incredibly slow at this basic task, or I'm not doing an accumulator with numpy in a way that is Python-like. I suspect the latter, which is why I'm asking SO for help.
I read about numpy.add.at
, but the unpacked byte arrays I would add to the buckets
array do not have offset values that translate naturally to the "shape" of the buckets
array.
Is there a way to store and increment an array of (long) integers, that does not overflow, and which is reasonably performant?
I could rewrite this in C, I guess, but am hoping there is something in numpy I am overlooking that will solve this quickly. Thanks for your advice.
Update
I had older versions of numpy and scipy that did not support numpy.add.at
. So that was another issue to look into.
I'll try the following and see how that goes:
first_byte_arr = np.array(first_bytes_unpacked)
second_byte_arr = np.array(second_bytes_unpacked)
offsets = first_byte_arr * buckets_per_signal_type + second_byte_arr
np.add.at(buckets, offsets, 1L)
Hopefully it runs a little faster!
Update II
Using np.add.at
and np.array
, this job will take roughly 12 days to complete. I'm going to give up on numpy
for now and go back to reading raw bytes with C, where the runtimes are a bit more reasonable. Thanks all for your advice!