8

I have the following 3d numpy array np.random.rand(6602, 3176, 2). I would like to convert it to a 2d array (numpy or pandas.DataFrame), where each value inside is a tuple, such that the shape is (6602, 3176).

This questioned helped me see how to decrease the dimensions, but I still struggle with the tuple bit.

Newskooler
  • 3,973
  • 7
  • 46
  • 84
  • 5
    I think I have a better question: why would you want that? Strictly speaking, what you are asking would require you to use a NumPy array of type `object`, but it is not a good use-case for the problem you seems to be dealing with. Perhaps you are running into the [XY problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem)? However, for most practical purposes, sticking to the 3D array and figuring a smart way of using the `axis` parameter of NumPy functions is probably the way to go. – norok2 Sep 16 '19 at 10:54
  • @norok2 you may very well be right. Maybe I should rethink a more elegant solution to it. Thanks, the link is interesting to read. – Newskooler Sep 16 '19 at 10:57
  • "where each value inside is a tuple, such that the shape is (6602, 3176)." can you please rephrase that statement. – moe asal Sep 16 '19 at 10:59

4 Answers4

9

Here is a one-liner which takes a few seconds on the full (6602, 3176, 2) problem

a = np.random.rand(6602, 3176, 2)

b = a.view([(f'f{i}',a.dtype) for i in range(a.shape[-1])])[...,0].astype('O')

The trick here is to viewcast to a compund dtype which spans exactly one row. When such a compound dtype is then cast on to object each compound element is converted to a tuple.

UPDATE (hat tip @hpaulj) there is a library function that does precisely the view casting we do manually: numpy.lib.recfunctions.unstructured_to_structured

Using this we can write a more readable version of the above:

import numpy.lib.recfunctions as nlr

b = nlr.unstructured_to_structured(a).astype('O')
Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
  • Very elegant! One question: why `f'f{i}'` instead of just `str(i)`? – norok2 Sep 16 '19 at 13:12
  • 1
    @norok2 force of habit I suppose. – Paul Panzer Sep 16 '19 at 13:15
  • 3
    `numpy.lib.recfunctions.unstructured_to_structured` is the new recommended tool for converting an array to a structured dtype. In this case it just eliminates the need for that `[...,0]` step. `unstructured_to_structured(a)` is enough. – hpaulj Sep 16 '19 at 17:16
2

If you really want to do, what you want to do, you have to set dtype of you array to object. E.g., if you have the mentioned array:

a = np.random.rand(6602, 3176, 2)

You could create a second empty array with shape (6602, 3176) and set dtype to object:

b = np.empty(a[:,:,0].shape, dtype=object)

and fill your array with tuples.

But in the end there is no big advantage! I'd just use slicing to get the tuples from your initial array a. You can just access the tuples of indexes n (1st dimension) and m (2nd dimension) and forget about the third dimension and slice your 3d array:

a[n,m,:]
AnsFourtyTwo
  • 2,480
  • 2
  • 13
  • 33
0

If you are happy with list instead of tuple, this could be achieved with the following trick:

  1. convert your array to list of lists using .tolist()
  2. make sure that you change the size of one of the innermost list (misalign)
  3. convert the list of lists back to NumPy array
  4. fix the modification of point 2.

This is implemented in the following function last_dim_as_list():

import numpy as np


def last_dim_as_list(arr):
    if arr.ndim > 1:
        # : convert to list of lists
        arr_list = arr.tolist()
        # : misalign size of the first innermost list
        temp = arr_list
        for _ in range(arr.ndim - 1):
            temp = temp[0]
        temp.append(None)
        # : convert to NumPy array
        # (uses `object` because of the misalignment)
        result = np.array(arr_list)
        # : revert the misalignment
        temp.pop()
    else:
        result = np.empty(1, dtype=object)
        result[0] = arr.tolist()
    return result

np.random.seed(0)
in_arr = np.random.randint(0, 9, (2, 3, 2))
out_arr = last_dim_as_list(in_arr)


print(in_arr)
# [[[5 0]
#   [3 3]
#   [7 3]]
#  [[5 2]
#   [4 7]
#   [6 8]]]
print(in_arr.shape)
# (2, 3, 2)
print(in_arr.dtype)
# int64

print(out_arr)
# [[list([5, 0]) list([3, 3]) list([7, 3])]
#  [list([5, 2]) list([4, 7]) list([6, 8])]]
print(out_arr.shape)
# (2, 3)
print(out_arr.dtype)
# object

However, I would NOT recommend taking this route unless you really know what you are doing. Most of the time you are better off by keeping everything as a NumPy array of higher dimensionality, and make good use of NumPy indexing.


Note that this could also be done with explicit loops, but the proposed approach should be much faster for large enough inputs:

def last_dim_as_list_loop(arr):
    shape = arr.shape
    result = np.empty(arr.shape[:-1], dtype=object).ravel()
    for k in range(arr.shape[-1]):
        for i in range(result.size):
            if k == 0:
                result[i] = []
            result[i].append(arr[..., k].ravel()[i])
    return result.reshape(shape[:-1])


out_arr2 = last_dim_as_list_loop(in_arr)

print(out_arr2)
# [[list([5, 0]) list([3, 3]) list([7, 3])]
#  [list([5, 2]) list([4, 7]) list([6, 8])]]
print(out_arr2.shape)
# (2, 3)
print(out_arr2.dtype)
# object

But the timings for this last are not exactly spectacular:

%timeit last_dim_as_list(in_arr)
# 2.53 µs ± 37.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit last_dim_as_list_loop(in_arr)
# 12.2 µs ± 21.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

The view-based approach proposed by @PaulPanzer is very elegant and more efficient than the trick proposed in last_dim_as_list() because it loops (internally) through the array only once as compared to twice:

def last_dim_as_tuple(arr):
    dtype = [(str(i), arr.dtype) for i in range(arr.shape[-1])]
    return arr.view(dtype)[..., 0].astype(object)

and therefore the timings on large enough inputs are more favorable:

in_arr = np.random.random((6602, 3176, 2))


%timeit last_dim_as_list(in_arr)
# 4.9 s ± 73.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit last_dim_as_tuple(in_arr)
# 3.07 s ± 117 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
norok2
  • 25,683
  • 4
  • 73
  • 99
0

A vectorized approach (it's a bit tricky):

mat = np.random.rand(6602, 3176, 2)

f = np.vectorize(lambda x:tuple(*x.items()), otypes=[np.ndarray])
mat2 = np.apply_along_axis(lambda x:dict([tuple(x)]), 2, mat)
mat2 = np.vstack(f(mat2))
mat2.shape
Out: (6602, 3176)

type(mat2[0,0])
Out: tuple
dtrckd
  • 657
  • 7
  • 17
  • This seems to be quite inefficient though. Could you perhaps explain the idea behind the steps you perform? – norok2 Sep 16 '19 at 16:54
  • @norok2, yes it is indeed. It creates dict proxy in the third axis in order to reduce it as a tuple. And I did this because converting to tuple only by using `np.apply_along_axis(lambda x:tuple(x), 2, mat)` ignore the tuple type and return a copy of the input array. – dtrckd Sep 16 '19 at 21:15