3

I'm trying to create a diff of my sorted numpy array such that if I record the value of the first row, and the diffs, i can recreate the original table but store less data.

So here's an example of the table:

my_array = numpy.array([(0,  0,   0,  0,  0,   0,  0,  0,   0,  0,  0,   0,  0,  0), 
                        (0,  0,   0,  0,  0,   0,  0,  0,   0,  0,  0,   0,  0,  1),
                        (0,  0,   0,  0,  0,   0,  0,  0,   0,  0,  0,   0,  0,  2), 
                        (9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 34),
                        (9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 35), 
                        (9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36)
                       ],'uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8')

And after running numpy.diff(my_array) I would have expected something like this:

[(0,  0,   0,  0,  0,   0,  0,  0,   0,  0,  0,   0,  0,  1), 
 (0,  0,   0,  0,  0,   0,  0,  0,   0,  0,  0,   0,  0,  1),
 (9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 32),
 (0,  0,   0,  0,  0,   0,  0,  0,   0,  0,  0,   0,  0,  1),
 (0,  0,   0,  0,  0,   0,  0,  0,   0,  0,  0,   0,  0,  1)
]

Note: The data above comes from the first & last three rows of the 'real' data, which is much much larger. With the full dataset, most of the rows after a diff would be 0,0,0,0,0,0,0,0,0,0,0,0,1 -- which can a) be stored in a much smaller struct, and b) will compress fantastically well on disk since most rows contain very similar data.

I should probably point out that the reason I have a whole bunch of uint8's in the first place, is because I needed to store an array of extremely large numbers, in the smallest amount of memory possible. The largest number was 185439173519100986733232011757860, which is too big for uint64. In fact, the smallest number of bits to store it would be 108 bits, or 14 bytes (to the nearest byte). So to fit these large numbers into numpy, i use the following two functions:

def large_number_to_numpy(number,columns): return tuple((number >> (8*x)) & 255 for x in range(columns-1,-1,-1))

def numpy_to_large_number(numbers): return sum([y << (8*x) for x,y in enumerate(numbers[::-1])])

Which is used like this:

>>> large_number_to_numpy(185439173519100986733232011757860L,14) (9L, 36L, 146L, 73L, 36L, 146L, 73L, 36L, 146L, 73L, 36L, 146L, 73L, 36L)

numpy_to_large_number((9L, 36L, 146L, 73L, 36L, 146L, 73L, 36L, 146L, 73L, 36L, 146L, 73L, 36L)) 185439173519100986733232011757860L

With the array created like this:

my_array = numpy.zeros(TOTAL_ROWS,','.join(14*['uint8']))

And then populated with:

my_array[x] = large_number_to_numpy(large_number,14)

But instead I get this:

>>> my_array
array([(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
       (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1),
       (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2),
       (9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 34),
       (9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 35),
       (9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36)],
      dtype=[('f0', 'u1'), ('f1', 'u1'), ('f2', 'u1'), ('f3', 'u1'), ('f4', 'u1'), ('f5', 'u1'), ('f6', 'u1'), ('f7', 'u1'), ('f8', 'u1'), ('f9', 'u1'), ('f10', 'u1'), ('f11', 'u1'), ('f12', 'u1'), ('f13', 'u1')])
>>> numpy.diff(my_array)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/numpy/lib/function_base.py", line 1567, in diff
    return a[slice1]-a[slice2]
TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype([('f0', 'u1'), ('f1', 'u1'), ('f2', 'u1'), ('f3', 'u1'), ('f4', 'u1'), ('f5', 'u1'), ('f6', 'u1'), ('f7', 'u1'), ('f8', 'u1'), ('f9', 'u1'), ('f10', 'u1'), ('f11', 'u1'), ('f12', 'u1'), ('f13', 'u1')]) dtype([('f0', 'u1'), ('f1', 'u1'), ('f2', 'u1'), ('f3', 'u1'), ('f4', 'u1'), ('f5', 'u1'), ('f6', 'u1'), ('f7', 'u1'), ('f8', 'u1'), ('f9', 'u1'), ('f10', 'u1'), ('f11', 'u1'), ('f12', 'u1'), ('f13', 'u1')]) dtype([('f0', 'u1'), ('f1', 'u1'), ('f2', 'u1'), ('f3', 'u1'), ('f4', 'u1'), ('f5', 'u1'), ('f6', 'u1'), ('f7', 'u1'), ('f8', 'u1'), ('f9', 'u1'), ('f10', 'u1'), ('f11', 'u1'), ('f12', 'u1'), ('f13', 'u1')])
J.J
  • 3,459
  • 1
  • 29
  • 35
  • 1
    I think the problem is that you have composite data (tuple-valued arrays), for which subtraction is not defined. Why don't you use a simple `uint8`-valued 2d array? You would only have to change all those `'uint8'` strings in the definition to a single `'uint8'`. – Andras Deak -- Слава Україні Jun 29 '16 at 14:34
  • 1
    The datatypes of the array rows are different. I guess python does not know how to subtract such datatypes. Maybe `my_array = int(my_array)` will work. – Aguy Jun 29 '16 at 14:35

2 Answers2

4

The problem is that you have a structured array instead of a regular 2-dimensional array, so numpy does not know how to subtract one tuple from another.

Convert your structured array to a regular array (from this SO question):

my_array = my_array.view(numpy.uint8).reshape((my_array.shape[0], -1))

and then do numpy.diff(my_array, axis=0).

Or, if you can, avoid creating a structured array by defining my_array as

numpy.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
             [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
             [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2],
             [9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 34],
             [9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 35],
             [9, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36, 146, 73, 36]],
            dtype=numpy.uint8)
Community
  • 1
  • 1
Alicia Garcia-Raboso
  • 13,193
  • 1
  • 43
  • 48
  • 1
    Note that just passing `dtype='uint8'` instead of `dtype='uint8,uint8,...'` should result in the same result. Even with the tuple input. – Andras Deak -- Слава Україні Jun 29 '16 at 14:48
  • Thank you both Alberto and Andras for your help - I see now it's because I have a weird datatype going on. Unfortunately, I don't really know what to do to fix that, since my array starts off empty and gets filled (like a structured array should). I've added more detail in the original post. Hope that helps. In the mean time i'll see if everything "just works" if i set the dtype to a single uint8. All the best! – J.J Jun 29 '16 at 15:23
  • 1
    How about if you change `tuple` to `numpy.array` in your `large_number_to_numpy` function? – Alicia Garcia-Raboso Jun 29 '16 at 15:27
  • Good idea Alberto! But unfortunately it didn't work :( Here's a MCVE: http://paste.ofcode.org/yfsZg5b2xFpAvQtqQ3ZG3y – J.J Jun 29 '16 at 15:35
  • Ah, but that combined with Andras' previous suggestion fixed it!! I am going to accept your answer for the points, but I will submit a new answer with the final solution. – J.J Jun 29 '16 at 15:43
0

Thanks to Alberto and Andras, here is what i needed to do:

Change my array from:

my_array = numpy.zeros(6,'uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8,uint8')

To:

my_array = numpy.zeros((6,14),'uint8')

I have absolutely no idea why one is different to the other, but I guess that's just how numpy rolls.

I can then populate with:

my_array[0] = large_number_to_numpy(0,14)
my_array[1] = large_number_to_numpy(1,14)
my_array[2] = large_number_to_numpy(2,14)
my_array[3] = large_number_to_numpy(185439173519100986733232011757858,14)
my_array[4] = large_number_to_numpy(185439173519100986733232011757859,14)
my_array[5] = large_number_to_numpy(185439173519100986733232011757860,14)

Generating:

array([[  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   1],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   2],
       [  9,  36, 146,  73,  36, 146,  73,  36, 146,  73,  36, 146,  73,  34],
       [  9,  36, 146,  73,  36, 146,  73,  36, 146,  73,  36, 146,  73,  35],
       [  9,  36, 146,  73,  36, 146,  73,  36, 146,  73,  36, 146,  73,  36]], dtype=uint8)

And diffing with numpy.diff(my_array,1,0) to give:

array([[  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  1],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  1],
       [  9,  36, 146,  73,  36, 146,  73,  36, 146,  73,  36, 146,  73, 32],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  1],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  1]], dtype=uint8)

Which is perfect :)

J.J
  • 3,459
  • 1
  • 29
  • 35
  • 2
    You originally created a `structured array`, with a compound `dtype` (14 named fields). The other array is a 2d array with a simple `dtype`, `uint8`. The difference between named fields and indexed columns is important. – hpaulj Jun 29 '16 at 17:12
  • It seems from your comment that there's a sort of fundamental difference in usage. Named columns are for different, named, series of data; like yearly GDP for a number of countries, whilst the 2d array is somewhat mathematically purer, like a matrix of pixels on a screen, or any other n-dimensional dataset. – J.J Jun 29 '16 at 17:53