6

I have an array like this:

array([('6506', 4.6725971801473496e-25, 0.99999999995088695),
       ('6601', 2.2452745388799898e-27, 0.99999999995270605),
       ('21801', 1.9849650921836601e-31, 0.99999999997999001), ...,
       ('45164194', 1.0413482803123399e-24, 0.99999999997453404),
       ('45164198', 1.09470356446595e-24, 0.99999999997635303),
       ('45164519', 3.7521365799080699e-24, 0.99999999997453404)], 
      dtype=[('pos', '|S100'), ('par1', '<f8'), ('par2', '<f8')])

And I want to turn it into this: (adding a prefix '2R' onto each value in the first column)

array([('2R:6506', 4.6725971801473496e-25, 0.99999999995088695),
       ('2R:6601', 2.2452745388799898e-27, 0.99999999995270605),
       ('2R:21801', 1.9849650921836601e-31, 0.99999999997999001), ...,
       ('2R:45164194', 1.0413482803123399e-24, 0.99999999997453404),
       ('2R:45164198', 1.09470356446595e-24, 0.99999999997635303),
       ('2R:45164519', 3.7521365799080699e-24, 0.99999999997453404)], 
      dtype=[('pos', '|S100'), ('par1', '<f8'), ('par2', '<f8')])

I looked up some stuff about nditer (but I want to support earlier versions of numpy.) Also I'm reading one should avoid iteration.

Greg
  • 45,306
  • 89
  • 231
  • 297
  • I was thinking I could reference just the first column and apply a function on that, but I'm getting `index error: invalid index` when I try this: `array[:,0]` – Greg May 06 '14 at 14:42
  • With a `np.rec.array` you would be able to access those columns using `array["pos"]`. But I don't know how to add anything in the "string addition broadcasting" manner you are looking for. – eickenberg May 06 '14 at 14:45
  • hmm, I can access the first column with array['pos'] but I'm not sure how to modify the values from there. (assuming that's the right direction) – Greg May 06 '14 at 14:49
  • possible duplicate of [Element-wise string concatenation in numpy](http://stackoverflow.com/questions/9958506/element-wise-string-concatenation-in-numpy) – Saullo G. P. Castro May 06 '14 at 14:56

3 Answers3

6

Using numpy.core.defchararray.add:

>>> from numpy import array
>>> from numpy.core.defchararray import add
>>>
>>> xs = array([('6506', 4.6725971801473496e-25, 0.99999999995088695),
...             ('6601', 2.2452745388799898e-27, 0.99999999995270605),
...             ('21801', 1.9849650921836601e-31, 0.99999999997999001),
...             ('45164194', 1.0413482803123399e-24, 0.99999999997453404),
...             ('45164198', 1.09470356446595e-24, 0.99999999997635303),
...             ('45164519', 3.7521365799080699e-24, 0.99999999997453404)],
...            dtype=[('pos', '|S100'), ('par1', '<f8'), ('par2', '<f8')])
>>> xs['pos'] = add('2R:', xs['pos'])
>>> xs
array([('2R:6506', 4.67259718014735e-25, 0.999999999950887),
       ('2R:6601', 2.24527453887999e-27, 0.999999999952706),
       ('2R:21801', 1.98496509218366e-31, 0.99999999997999),
       ('2R:45164194', 1.04134828031234e-24, 0.999999999974534),
       ('2R:45164198', 1.09470356446595e-24, 0.999999999976353),
       ('2R:45164519', 3.75213657990807e-24, 0.999999999974534)],
      dtype=[('pos', 'S100'), ('par1', '<f8'), ('par2', '<f8')])

UPDATE: You can use num.char.add instead of numpy.core.defchararray.add (commented by @joel-buursma):

>>> import numpy
>>> numpy.char == numpy.core.defchararray
True
falsetru
  • 357,413
  • 63
  • 732
  • 636
  • 1
    Impressive solution. If we were to append (instead of prepend) `'2R:'`, would this function handle its task significantly differently according to whether 'S100' leaves enough space or not? (Imagine for example that the number has 98 digits.) – eickenberg May 06 '14 at 14:57
  • 1
    @eickenberg, According to an experiment (http://pastebin.com/dVRyJmQH), `add` (both prepend/append) truncate trailing parts according to the size specified. – falsetru May 06 '14 at 15:02
  • 1
    You can now use `np.char` instead of `numpy.core.defchararray`. (https://numpy.org/doc/stable/reference/generated/numpy.char.add.html) – Joel Buursma Oct 27 '22 at 19:17
  • @JoelBuursma, Thank you for the information. I updated the answer according to your comment. – falsetru Oct 28 '22 at 06:59
2

A simple (albeit perhaps not optimal) solution is just:

a = np.array([('6506', 4.6725971801473496e-25, 0.99999999995088695),
       ('6601', 2.2452745388799898e-27, 0.99999999995270605),
       ('21801', 1.9849650921836601e-31, 0.99999999997999001),
       ('45164194', 1.0413482803123399e-24, 0.99999999997453404),
       ('45164198', 1.09470356446595e-24, 0.99999999997635303),
       ('45164519', 3.7521365799080699e-24, 0.99999999997453404)],
      dtype=[('pos', '|S100'), ('par1', '<f8'), ('par2', '<f8')])


a['pos'] = [''.join(('2R:',x)) for x in a['pos']]

In [11]: a
Out[11]:
array([('2R:6506', 4.67259718014735e-25, 0.999999999950887),
       ('2R:6601', 2.24527453887999e-27, 0.999999999952706),
       ('2R:21801', 1.98496509218366e-31, 0.99999999997999),
       ('2R:45164194', 1.04134828031234e-24, 0.999999999974534),
       ('2R:45164198', 1.09470356446595e-24, 0.999999999976353),
       ('2R:45164519', 3.75213657990807e-24, 0.999999999974534)],
      dtype=[('pos', 'S100'), ('par1', '<f8'), ('par2', '<f8')])

While I like @falsetru's answer for using core numpy routines, surprisingly, list comprehension seems a bit faster:

In [19]: a = np.empty(20000, dtype=[('pos', 'S100'), ('par1', '<f8'), ('par2', '<f8')])

In [20]: %timeit a['pos'] = [''.join(('2R:',x)) for x in a['pos']]
100 loops, best of 3: 11.1 ms per loop

In [21]: %timeit a['pos'] = add('2R:', a['pos'])
100 loops, best of 3: 15.7 ms per loop

Definitely benchmark your own use case and hardware to see which makes more sense for your actual application though. One of the things I've learned is that in certain situations, basic python constructs can outperform numpy built-ins, depending on the task at hand.

JoshAdel
  • 66,734
  • 27
  • 141
  • 140
  • Interesting results in timing it. Thanks! – Greg May 06 '14 at 15:07
  • both the solutions throw errors for me: `join` : `TypeError: sequence item 1: expected str instance, numpy.bytes_ found` and `add`: `TypeError: must be str, not numpy.bytes_` – Gaurav Singhal Sep 18 '18 at 10:48
0

Another slightly faster solution is to use list comprehension with + operator. Though I do not understand why it is faster. But it is definitely very elegant and basic.

a['pos'] = ["2R:" + x for x in a['pos']]

Timings:

%timeit a['pos'] = ["2R:" + x for x in a['pos']]
8.07 ms ± 64.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit a['pos'] = [''.join(('2R:',x)) for x in a['pos']]
9.53 ms ± 391 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit a['pos'] = add('2R:', a['pos'])
14.2 ms ± 337 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

PS: I created the array a using slightly different definition:

a = np.empty(20000, dtype=[('pos', 'U5'), ('par1', '<f8'), ('par2', '<f8')])

as if I use type Sxxx for pos, concatenation produces a type error for me.

Gaurav Singhal
  • 998
  • 2
  • 10
  • 25