1

I have a script that accumulates (counts) bytes contained in two files. Bytes are C-like unsigned char integer values between 0 and 255.

The goal of this accumulator script is to count the joint counts or frequencies of bytes in those two files. Possibly to extend this out to multiple files/dimensions.

The two files are the same size, but they are very large, on the order of 6 TB or so.

I am using numpy.uint64 values, as I am getting overflow problems using Python's int type.

I have a 1D accumulator array that is 255**2 in length, in order to store joint counts.

I calculate the offset from a row-by-column -to- array offset calculation, so as to increment the joint frequency at the right index. I walk through both files in chunks of bytes (n_bytes), unpack them, and increment the frequency counter.

Here's a rough sketch of the code:

import numpy
import ctypes
import struct

buckets_per_signal_type = 2**(ctypes.c_ubyte(1).value * 8)
total_buckets = buckets_per_signal_type**2
buckets = numpy.zeros((total_buckets,), dtype=numpy.uint64)

# open file handles to two files (omitted for brevity...)

# buffer size that is known ahead of time to be a divisible 
# unit of the original files 
# (for example, here, reading in 2.4e6 bytes per loop iteration)
n_bytes = 2400000

total_bytes = 0L

# format used to unpack bytes
struct_format = "=%dB" % (n_bytes)

while True:    
    # read in n_bytes chunk from each file
    first_file_bytes = first_file_handle.read(n_bytes)
    second_file_bytes = second_file_handle.read(n_bytes)

    # break if both file handles have nothing left to read
    if len(first_file_bytes) == 0 and len(second_file_bytes) == 0:
        break

    # unpack actual bytes
    first_bytes_unpacked = struct.unpack(struct_format, first_file_bytes)
    second_bytes_unpacked = struct.unpack(struct_format, second_file_bytes)

    for index in range(0, n_bytes):
        first_byte = first_bytes_unpacked[index]
        second_byte = second_bytes_unpacked[index]
        offset = first_byte * buckets_per_signal_type + second_byte
        buckets[offset] += 1

    total_bytes += n_bytes
    # repeat until both file handles are both EOF

# print out joint frequency (omitted)

Compared with the version where I used int, this is incredibly slow, on an order of magnitude slower. The original job completed (incorrectly, due to overflow) in about 8 hours, and this numpy-based version had to be quit early as it will appear to take about 12-14 days to finish.

Either numpy is incredibly slow at this basic task, or I'm not doing an accumulator with numpy in a way that is Python-like. I suspect the latter, which is why I'm asking SO for help.

I read about numpy.add.at, but the unpacked byte arrays I would add to the buckets array do not have offset values that translate naturally to the "shape" of the buckets array.

Is there a way to store and increment an array of (long) integers, that does not overflow, and which is reasonably performant?

I could rewrite this in C, I guess, but am hoping there is something in numpy I am overlooking that will solve this quickly. Thanks for your advice.

Update

I had older versions of numpy and scipy that did not support numpy.add.at. So that was another issue to look into.

I'll try the following and see how that goes:

first_byte_arr = np.array(first_bytes_unpacked)                                                                                 
second_byte_arr = np.array(second_bytes_unpacked)                                                                                
offsets = first_byte_arr * buckets_per_signal_type + second_byte_arr                                                               
np.add.at(buckets, offsets, 1L)

Hopefully it runs a little faster!

Update II

Using np.add.at and np.array, this job will take roughly 12 days to complete. I'm going to give up on numpy for now and go back to reading raw bytes with C, where the runtimes are a bit more reasonable. Thanks all for your advice!

Alex Reynolds
  • 95,983
  • 54
  • 240
  • 345
  • "I read about `numpy.add.at`, but the unpacked byte arrays I would add to the `buckets` array do not have offset values that translate naturally to the "shape" of the `buckets` array." - could you force that using, say, `np.digitize`? Unfortunately I can't follow the things you are doing with `struct` well enough to give a good answer about the `numpy` – Daniel F Sep 04 '17 at 06:09
  • `struct` is used here to quickly generate an array of integers between 0 and 255 from each byte. I have one byte from the first file that maps to byte or integer A and one byte from the second file that maps to byte/integer B. I count how many times I see A and B together. I don't seem to get enough from the `numpy.digitize` documentation to see how it applies here, but I'll review it some more and see if it helps. Thanks for the pointer. – Alex Reynolds Sep 04 '17 at 06:16
  • So the only difference between this and the `int` version is the `dtype` of `buckets`? `add.at` is only useful if you are indexing duplicate elements in one call. In your code `offset` is just a scalar, right? – hpaulj Sep 04 '17 at 06:19
  • The original array is a zeroed list of `int` defined like so: `buckets = [0] * total_buckets`. I was aiming for something Python-y before going to a third-party library. – Alex Reynolds Sep 04 '17 at 06:20
  • 1
    Why not use a 2d `(256, 256)` `buckets` and do `buckets[first, second] +=1`? Or even something like `np.add.at(buckets, (first, second) 1)` – Daniel F Sep 04 '17 at 06:22
  • in any case, a [mcve] that removes dependencies other than `numpy` would be helpful to get a good answer, as it's probably a simple vectorization problem. And we've got some vectorization wizards around here that can help if they can understand the problem. – Daniel F Sep 04 '17 at 06:27
  • I'm afraid a minimal, complete and verifiable example may be beyond the scope of this question. It's just too much code to copy and paste. I was hoping the pseudocode would be clear to people familiar with those libraries. Thanks for the help, in any case. I do appreciate it. – Alex Reynolds Sep 04 '17 at 06:29

2 Answers2

3

Without trying to follow all the file read and struct code, it looks like you are adding 1 to an assortment of slots in the buckets array. That part shouldn't be taking days.

But to get an idea of how the dtype of buckets affects that step, I'll test adding 1 to a random assortment of indices.

In [57]: idx = np.random.randint(0,255**2,10000)
In [58]: %%timeit buckets = np.zeros(255**2, dtype=np.int64)
    ...: for i in idx:
    ...:    buckets[i] += 1
    ...: 
9.38 ms ± 39.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [59]: %%timeit buckets = np.zeros(255**2, dtype=np.uint64)
    ...: for i in idx:
    ...:    buckets[i] += 1
    ...: 
71.7 ms ± 2.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

uint64 is about 8x slower.

If there weren't duplicates, we could just do buckets[idx] += 1. But allowing for duplicates we have to use add.at:

In [60]: %%timeit buckets = np.zeros(255**2, dtype=np.int64)
    ...: np.add.at(buckets, idx, 1)
    ...: 
1.6 ms ± 348 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [61]: %%timeit buckets = np.zeros(255**2, dtype=np.uint64)
    ...: np.add.at(buckets, idx, 1)
    ...: 
1.62 ms ± 15.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Interesting that dtype uint64 does not affect the timing in this case.

You mention in comments that you tried a list accumulator. I assume like this:

In [62]: %%timeit buckets = [0]*(255**2)
    ...: for i in idx:
    ...:    buckets[i] += 1
    ...: 
3.59 ms ± 44.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

That's faster than the iterative version of the array. In general iteration on arrays is slower than on lists. It's the 'whole-array' operations that are faster, such as add.at.


To verify that add.at is the correct substitute for iteration, compare

In [63]: buckets0 = np.zeros(255**2, dtype=np.int64)
In [64]: for i in idx: buckets0[i] += 1

In [66]: buckets01 = np.zeros(255**2, dtype=np.int64)
In [67]: np.add.at(buckets01, idx, 1)
In [68]: np.allclose(buckets0, buckets01)
Out[68]: True

In [69]: buckets02 = np.zeros(255**2, dtype=np.int64)
In [70]: buckets02[idx] += 1
In [71]: np.allclose(buckets0, buckets02)
Out[71]: False

In [75]: bucketslist = [0]*(255**2)
In [76]: for i in idx: bucketslist[i] += 1
In [77]: np.allclose(buckets0, bucketslist)
Out[77]: True
hpaulj
  • 221,503
  • 14
  • 230
  • 353
0
  1. numpy has its own file I/O method in fromfile which you'd probably be better off using if you want the output in a numpy array. (See this question)

  2. Probably better to use the array structure given by numpy to make your buckets a 2d array:

    buckets_per_signal_type = 2**(ctypes.c_ubyte(1).value * 8)
    buckets = numpy.zeros((buckets_per_signal_type, buckets_per_signal_type), dtype=numpy.uint64)
    

    And then just use np.add.at to increment the bins

    # define record_type to match your data
    while True
        data_1 = np.fromfile(first_file_handle, dtype=record_dtype, count=nbytes)
        data_2 = np.fromfile(second_file_handle, dtype=record_dtype, count=nbytes)
        s = np.minimum(data_1.size, data_2.size)
        if s == 0:
            break
        np.add.at(buckets, [data_1[:s], data_2[:s]], 1)
    
Daniel F
  • 13,620
  • 2
  • 29
  • 55