0

Question

Please help understand the cause of the problem in the code below, and suggest related articles to look into.

Background

In my understanding, a numpy structured type with multiple fields which includes sub-array is defined as:

the_type = np.dtype(
  [                                        # ndarray
    (<name>, <numpy dtype>, <numpy shape>) # (name, dtype, shape)
  ]
)
np.shape([[1, 2]])  # 2D matrix shape (1, 2) with 1 row x 2 columns
np.shape([1])       # 1D array  shape (1, )
np.shape(1)         # 0D array  shape () which is not a scalar

subarray data type A structured data type may contain a ndarray with its own dtype and shape:

dt = np.dtype([('a', np.int32), ('b', np.float32, (3,))])
np.zeros(3, dtype=dt)
---
array([(0, [0., 0., 0.]), (0, [0., 0., 0.]), (0, [0., 0., 0.])],
      dtype=[('a', '<i4'), ('b', '<f4', (3,))])

Problem

The first code works with a warning, which I believe complaining 1 in ("b", np.ubyte, 1) is not a proper numpy shape and it should be in the 1D array shape (1,). This is not an issue.

color_type = np.dtype([
    ("r", np.ubyte, (1,)),
    ("g", np.ubyte, (1)),     # <--- warning
    ("b", np.ubyte, 1)        # <--- warning
])
---
FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

However, the second code does not work and would like to understand why.

  1. According to the warning in the code above, I believe 16 and (16) are both (16,). Is it correct or depends on the dtype?
  2. I think a Unicode string is aarray in Python as "hoge"[3] -> 'e', then why (16,) is an error?
dt = np.dtype(
  [
    ('first', np.unicode_, 16),    # OK and no warning
    ('middle', np.unicode_, (16)), # OK and no warning
    ('last', np.unicode_, (16,)),  # <----- Error 
    ('grades', np.float64, (2,))   # OK and no warning
  ]
)
x = np.array(
    [
        ('Sarah', 'Jeanette', 'Conner', (8.0, 7.0)), 
        ('John', '', 'Conner', (6.0, 7.0))
    ], 
    dtype=dt
)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-382-3e8049d5246c> in <module>
----> 1 dt = np.dtype(
      2   [
      3     ('first', np.unicode_, 16),
      4     ('middle', np.unicode_, (16)),
      5     ('last', np.unicode_, (16,)),

ValueError: invalid itemsize in generic type tuple

Update

Understood that I misunderstood the dtype. In this case, a shape is not required but the length.

mon
  • 18,789
  • 22
  • 112
  • 205

4 Answers4

3
  1. According to the warning in the code above, I believe 16 and (16) are both (16,). Is it correct or depends on the dtype?

In Python, 16 is an integer literal, (16) is a Python parenthesized expression that evaluates to the value 16. (Remember that when you surround an expression by parentheses, you do it to control the order of evaluation of operators, and not to convert the expression into a tuple. For example, in the expression (2 + 3)/2, the parentheses that surround 2 + 3 do not result in a tuple; rather, they only serve to ensure that that the + operator gets evaluated before the / operator).

In Python, (16,) is definitely a tuple. It is therefore not equivalent to 16 or (16).

  1. I think a Unicode string is aarray in Python as "hoge"[3] -> 'e'

No, in Python, a Unicode string is not an array. The fact that you are able to perform the indexing operation [] on a unicode string doesn't necessarily make it an array. For that matter, you can perform the [] operation on a dict too, and dicts are not arrays either.

then why (16,) is an error?

In numpy, when you are specifying a field to be a unicode string, numpy needs to know how many unicode characters will be held in that string. (numpy only supports fixed-length strings as fields of a custom dtype) . In other words, you need to tell numpy what the length of the unicode string is. And that, of course, must be a simple integer 16, rather than a tuple (16,).

BTW, if you don't specify the length of the unicode string field, there won't be any error, as numpy will assume that the field is a zero-length unicode string; you will get an error at the time of assigning values to the string field.

fountainhead
  • 3,584
  • 1
  • 8
  • 17
2

As the error is indicating, the third position in ('first', np.unicode_, 16) is interpreted as the size for the type of the tuple element. So, first is defined as a size 16 unicode field.

('middle', np.unicode_, (16)) also works, since (16) just evaluates to 16, the parentheses are superfluous. So, middle will be just like first.

However, ('last', np.unicode_, (16,)) causes an error, because you're passing a tuple as the itemsize for a type of a tuple element that only has one dimension. (16,) can only be understood as a tuple and does not get automatically evaluated into a scalar, while np.dtype expects a scalar as the itemsize for an np.unicode_ field.

If your aim was to define a field that takes an array of sixteen unicode values, of some length (say, 10), you'd use:

dt = np.dtype(
  [
    ('first', np.unicode_, 16),    
    ('middle', np.unicode_, (16)), 
    ('last', 'U10', (16,)),  
    ('grades', np.float64, (2,))   
  ]
)

And then you could define an array like:

a = np.array([('x','y',
               ['0','1','2','3','4','5','6','7','8','9','a','b','c','d','e','f'],
               [1.0, 2.0])], dt)

a would then be defined as:

array([('x', 'y', ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'], [1., 2.])],
      dtype=[('first', '<U16'), ('middle', '<U16'), ('last', '<U10', (16,)), ('grades', '<f8', (2,))])

A simpler definition of dt with the same result as above:

dt = np.dtype(
  [
    ('first', 'U16'),    
    ('middle', 'U16'), 
    ('last', 'U10', (16,)),  
    ('grades', np.float64, (2,))   
  ]
)
Grismar
  • 27,561
  • 4
  • 31
  • 54
  • Thanks for the answer but still not clear as my understanding of structured type definition is from the doc saying "A structured data type may contain a ndarray with its own dtype and shape", hence (16) part is supposed to be numpy shape. Which I believe is (n,) format? Or 16 is not defining a numpy shape? – mon Dec 21 '20 at 03:00
  • Am I correct in thinking you want `'last'` to be an array of 16 unicode text fields? What size unicode strings? – Grismar Dec 21 '20 at 04:25
  • In 'last' and 'grades' the tuple is a shape. The others are just alternate ways of specifying 'U16'. – hpaulj Dec 21 '20 at 06:59
  • In 'last' and 'grades' the tuple is a shape. The others are just alternate ways of specifying 'U16'. – hpaulj Dec 21 '20 at 06:59
  • Agreed @hpaulj - I left then unchanged from the originally provided code by OP, but explained exactly what you're saying in the answer. But since you didn't take that away from the provided solution, I'll add one that's less convoluted. – Grismar Dec 21 '20 at 07:05
  • My comment was meant more for the OP, who still seems a bit confused by the different uses. – hpaulj Dec 21 '20 at 07:32
2

string length is not the same as field shape

Here's an array with 2 fields, one with a string dtype, the other numeric:

In [148]: np.array([('abc',2),('defg',5)], dtype=[('x','U10'),('y',int)] )
Out[148]: array([('abc', 2), ('defg', 5)], dtype=[('x', '<U10'), ('y', '<i8')])
In [149]: _.shape
Out[149]: (2,)
In [150]: __['x']
Out[150]: array(['abc', 'defg'], dtype='<U10')

Note that I specify a unicode string length, 'U10' (10 char).

I can also specify the string length with a separate number. That's what you are doing with np.unicode_, 16. The resulting dtype is the same.

In [151]: np.array([('abc',2),('defg',5)], dtype=[('x','U',10),('y',int)] )
Out[151]: array([('abc', 2), ('defg', 5)], dtype=[('x', '<U10'), ('y', '<i8')])

But if I provide a number after the numeric dtype, I get a new dimension. That's the (<name>, <numpy dtype>, <numpy shape>) specification:

In [152]: np.array([('abc',[2,3]),('defg',[5,4])], dtype=[('x','U',10),('y',int,2)] )
Out[152]: 
array([('abc', [2, 3]), ('defg', [5, 4])],
      dtype=[('x', '<U10'), ('y', '<i8', (2,))])
In [153]: _['y']              # shape (2,2)
Out[153]: 
array([[2, 3],
       [5, 4]])

I could define the string field to have a dimension:

In [155]: np.array([(['abc','xuz'],),(['defg','foo'],)], dtype=[('x','U10',2)] ) 
Out[155]: array([(['abc', 'xuz'],), (['defg', 'foo'],)], dtype=[('x', '<U10', (2,))])
In [156]: _['x']
Out[156]: 
array([['abc', 'xuz'],
       ['defg', 'foo']], dtype='<U10')

Here again the shape is (2,2).

The third tuple element has a different function in these two expressions: ('x','U',10) and ('x','U10',2)

Usually I use 'U10', so haven't encountered the 'U',10 case before. I could combine the two with:

In [158]: np.array([(['abc','xuz'],),(['defg','foo'],)], dtype=[('x',('U',10),2)] )
Out[158]: array([(['abc', 'xuz'],), (['defg', 'foo'],)], dtype=[('x', '<U10', (2,))])

That's the same as [155].

So that should explain why ('x','U',(10,)) does not work; the 10 here is a string length, as in 'U10', not a shape.

Another example

One 'U10' string per field:

In [166]: np.zeros((1,), dtype=[('x','U',10)])
Out[166]: array([('',)], dtype=[('x', '<U10')])

same:

In [167]: np.zeros((1,), dtype=[('x','U10')])
Out[167]: array([('',)], dtype=[('x', '<U10')])

10 'U1' strings per field:

In [168]: np.zeros((1,), dtype=[('x','U1',10)])
Out[168]: 
array([(['', '', '', '', '', '', '', '', '', ''],)],
      dtype=[('x', '<U1', (10,))])

The field shape can be a multidimensional:

In [169]: np.zeros((1,), dtype=[('x','U1',(2,3))])
Out[169]: array([([['', '', ''], ['', '', '']],)], dtype=[('x', '<U1', (2, 3))])
In [170]: _['x']
Out[170]: 
array([[['', '', ''],
        ['', '', '']]], dtype='<U1')
In [171]: _.shape
Out[171]: (1, 2, 3)

A tuple if fine when specifying the field shape, but not when specifying the string length. If you want the third tuple element to be a field shape, specify 'U10', not 'U' or 'unicode'.

the future warning

The warning is a different matter:

In [175]: np.zeros((1,), dtype=[('x','U10')])
Out[175]: array([('',)], dtype=[('x', '<U10')])
In [176]: _['x'].shape
Out[176]: (1,)

Up to now this is the same thing, with the '1' making no difference:

In [177]: np.zeros((1,), dtype=[('x','U10',1)])
<ipython-input-177-932c79fbeaf4>:1: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np.zeros((1,), dtype=[('x','U10',1)])
Out[177]: array([('',)], dtype=[('x', '<U10')])
In [178]: _['x'].shape
Out[178]: (1,)

But they are tightening of rough edges, so that in the future it will behave like:

In [179]: np.zeros((1,), dtype=[('x','U10',(1,))])
Out[179]: array([([''],)], dtype=[('x', '<U10', (1,))])
In [180]: _['x'].shape
Out[180]: (1, 1)

That will make it consistent with other uses of the field shape:

In [183]: np.zeros((1,), dtype=[('x','U10',2)])['x'].shape
Out[183]: (1, 2)
In [184]: np.zeros((1,), dtype=[('x','U10',(2,))])['x'].shape
Out[184]: (1, 2)
hpaulj
  • 221,503
  • 14
  • 230
  • 353
0

You wrote,

"I believe 16 and (16) are both (16,)"

That is wrong.

In python, there is a built-in container type named tuple

The following code provides an example of a tuple

x = (1, 56, 7, 9, 13)

The following code does NOT create a tuple containing the number 16.

y = (16)
print(type(y))
# <class 'int'>

The reason for this is very simple: if parentheses wrap only one object, then the parentheses denote mathematical order of operations, not a tuple

x = (1 + 5) * 9  
x = (  6  ) * 9
x =    6    * 9
x = 54

So... (16) is not a tuple.

  • 16 is an integer. 16 is NOT a tuple
  • (16) is also an integer. (16) NOT a tuple
  • (16,) is a tuple.
print((16) == 16)
# prints `True`

Ignore everything except container and the for-loop in the following code. Try looping over (5). The result is very different than having a for-loop over (1, 2, 3).

import io

string_stream = io.StringIO()

try:
    container = (1, 65, 8, 3, 3, 9)
    container = (5) # try commenting-out this line
    ####################################
    # WILL THE FOLLOWING FOR-LOOP WORK!??
    ####################################
    for elem in iter(container):
        print(
            elem,
            file = string_stream
        )
    print(
        "THE FOR-LOOP SUCCESSFULLY EXECUTED!",
        file = string_stream
    )
except TypeError as tipe_ehror:
   print(
       container,
       type(tipe_ehror),
       tipe_ehror,
       sep = "\n",
       file = string_stream
   )
finally:
    print(string_stream.getvalue())

(5) is NOT a container.
(5) has parentheses denoting order-of-operations for some math.

Note that the (1 + 2) may look like more than one thing inside of parentheses.

(1 + 2) is actually only ONE object inside of parentheses.

The plus sign (+) is the output returned by the function int.__add__

Even if it looks like a lot of stuff, everything inside of mathy-parenthesis eventually collapses to only one value inside of parentheses.

x = ((1 + 2) * 7) 
x = ((3) * 7)
x = (3 * 7)
x = (21)
x = 21

x = int.__mul__(int.__add__(1, 2), 7)    
Toothpick Anemone
  • 4,290
  • 2
  • 20
  • 42