0

I have a list of lists (each sublist of the same length) of tuples (each tuple of the same length, 2). Each sublist represents a sentence, and the tuples are bigrams of that sentence.

When using np.asarray to turn this into an array, python seems to interpret the tuples as me asking for a 3rd dimension to be created.

Full working code here:

import numpy as np 
from nltk import bigrams  

arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

bi_grams = []
for sent in arr:
    bi_grams.append(list(bigrams(sent)))
bi_grams = np.asarray(bi_grams)
print(bi_grams)

So before turning bi_grams to an array it looks like this: [[(1, 2), (2, 3)], [(4, 5), (5, 6)], [(7, 8), (8, 9)]]

Output of above code:

array([[[1, 2],
        [2, 3]],

       [[4, 5],
        [5, 6]],

       [[7, 8],
        [8, 9]]])

Converting a list of lists to an array in this way is normally fine, and creates a 2D array, but it seems that python interprets the tuples as an added dimension, so the output is of shape (3, 2, 2), when in fact I want, and was expecting, a shape of (3, 2).

The output I want is:

array([[(1, 2), (2, 3)],
       [(4, 5), (5, 6)],
       [(7, 8), (8, 9)]])

which is of shape (3, 2).

Why does this happen? How can I achieve the array in the form/shape that I want?

quanty
  • 824
  • 1
  • 12
  • 21
  • 2
    Why would you even want such an array? Why not just keep a list? If you look at the docs: "Input data, in any form that can be converted to an array. This includes lists, lists of tuples, tuples, tuples of tuples, tuples of lists and ndarrays." So `list` and `tuple` objects always try to be interpreted as dimensions in an array – juanpa.arrivillaga Mar 11 '18 at 22:48
  • Why wouldn't it happen? `numpy` doesn't special-case tuples when converting nested sequences into multi-dimensional arrays (unlike strings, say). – jonrsharpe Mar 11 '18 at 22:50
  • If you do force it to be an array of 2-tuples, you're not getting much advantage out of numpy, because it has to treat each element as "general slow Python object". If you don't want a 3x2x2 array of ints, maybe you at least want a 3x2 array of structs-of-two-ints instead? – abarnert Mar 11 '18 at 22:52
  • Any solution to this would necessarily produce an array with `dtype=object` - which kills all of numpy's ability to do bulk operations on numeric values. You might as well just keep the data as ordinary Python lists-of-lists – jasonharper Mar 11 '18 at 22:52
  • You *could* use a structured datatype, something like `bigram_type = np.dtype([('x','u4'),('y','u4')])`, not sure how useful this would be. I would just stick with `list` objects – juanpa.arrivillaga Mar 11 '18 at 22:56
  • I want such an array so that I can add it to my existing array of other feature types for a sentiment classifier. I have other features in an array, and I want to concatenate that array with an array of bigrams for each sentence so that I can use both sets of features. Is that a nonsensical thing to do? – quanty Mar 11 '18 at 23:47

2 Answers2

1

To np.array, your list of lists of tuples isn't any different from a list of lists of lists. It's iterables all the way down. np.array tries to create as high a dimensional array as possible. In this case that is 3d.

There are ways of side stepping that and making a 2d array that contains objects, where those objects are things like tuples. But as noted in the comments, why would you want that?

In a recent SO question, I came up with this way of turning a n-d array into an object array of (n-m)-d shape:

In [267]: res = np.empty((3,2),object)
In [268]: arr = np.array(alist)
In [269]: for ij in np.ndindex(res.shape):
     ...:     res[ij] = arr[ij]
     ...:     
In [270]: res
Out[270]: 
array([[array([1, 2]), array([2, 3])],
       [array([4, 5]), array([5, 6])],
       [array([7, 8]), array([8, 9])]], dtype=object)

But that's a 2d array of arrays, not of tuples.

In [271]: for ij in np.ndindex(res.shape):
     ...:     res[ij] = tuple(arr[ij].tolist())
     ...:     
     ...:     
In [272]: res
Out[272]: 
array([[(1, 2), (2, 3)],
       [(4, 5), (5, 6)],
       [(7, 8), (8, 9)]], dtype=object)

That's better (or is it?)

Or I could index the nested list directly:

In [274]: for i,j in np.ndindex(res.shape):
     ...:     res[i,j] = alist[i][j]
     ...:     
In [275]: res
Out[275]: 
array([[(1, 2), (2, 3)],
       [(4, 5), (5, 6)],
       [(7, 8), (8, 9)]], dtype=object)

I'm using ndindex to generate the all the indices of a (3,2) array.

The structured array mentioned in the comments works because for a compound dtype, tuples are distinct from lists.

In [277]: np.array(alist, 'i,i')
Out[277]: 
array([[(1, 2), (2, 3)],
       [(4, 5), (5, 6)],
       [(7, 8), (8, 9)]], dtype=[('f0', '<i4'), ('f1', '<i4')])

Technically, though, that isn't an array of tuples. It just represents the elements (or records) of the array as tuples.

In the object dtype array, the elements of the array are pointers to the tuples in the list (at least in the Out[275] case). In the structured array case the numbers are stored in the same as with a 3d array, as bytes in the array data buffer.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Actually, unless the rhs is a max depth array one can simply preallocate and assign `obj_array[...] = nested_thing` seems to work just fine. – Paul Panzer Mar 12 '18 at 00:26
1

Here are two more methods to complement @hpaulj's answer. One of them, the frompyfunc methods seems to scale a bit better than the other methods, although hpaulj's preallocation method is also not bad if we get rid of the loop. See timings below:

import numpy as np
import itertools

bi_grams = [[(1, 2), (2, 3)], [(4, 5), (5, 6)], [(7, 8), (8, 9)]]

def f_pp_1(bi_grams):
    return np.frompyfunc(itertools.chain.from_iterable(bi_grams).__next__, 0, 1)(np.empty((len(bi_grams), len(bi_grams[0])), dtype=object))

def f_pp_2(bi_grams):
    res = np.empty((len(bi_grams), len(bi_grams[0])), dtype=object)
    res[...] = bi_grams
    return res

def f_hpaulj(bi_grams):
    res = np.empty((len(bi_grams), len(bi_grams[0])), dtype=object)
    for i, j in np.ndindex(res.shape):
        res[i, j] = bi_grams[i][j]
    return res

print(np.all(f_pp_1(bi_grams) == f_pp_2(bi_grams)))
print(np.all(f_pp_1(bi_grams) == f_hpaulj(bi_grams)))

from timeit import timeit
kwds = dict(globals=globals(), number=1000)

print(timeit('f_pp_1(bi_grams)', **kwds))
print(timeit('f_pp_2(bi_grams)', **kwds))
print(timeit('f_hpaulj(bi_grams)', **kwds))

big = 10000 * bi_grams

print(timeit('f_pp_1(big)', **kwds))
print(timeit('f_pp_2(big)', **kwds))
print(timeit('f_hpaulj(big)', **kwds))

Sample output:

True                      <- same result for
True                      <- different methods
0.004281356999854324      <- frompyfunc          small input
0.002839841999957571      <- prealloc ellipsis   small input
0.02361366100012674       <- prealloc loop       small input
2.153144505               <- frompyfunc          large input
5.152567720999741         <- prealloc ellipsis   large input
33.13142323599959         <- prealloc looop      large input
Paul Panzer
  • 51,835
  • 3
  • 54
  • 99