33

I have an array of floats that I have normalised to one (i.e. the largest number in the array is 1), and I wanted to use it as colour indices for a graph. In using matplotlib to use grayscale, this requires using strings between 0 and 1, so I wanted to convert the array of floats to an array of strings. I was attempting to do this by using "astype('str')", but this appears to create some values that are not the same (or even close) to the originals.

I notice this because matplotlib complains about finding the number 8 in the array, which is odd as it was normalised to one!

In short, I have an array phis, of float64, such that:

numpy.where(phis.astype('str').astype('float64') != phis)

is non empty. This is puzzling as (hopefully naively) it appears to be a bug in numpy, is there anything that I could have done wrong to cause this?

Edit: after investigation this appears to be due to the way the string function handles high precision floats. Using a vectorized toString function (as from robbles answer), this is also the case, however if the lambda function is:

lambda x: "%.2f" % x

Then the graphing works - curiouser and curiouser. (Obviously the arrays are no longer equal however!)

V.S.
  • 2,924
  • 4
  • 32
  • 43

5 Answers5

42

You seem a bit confused as to how numpy arrays work behind the scenes. Each item in an array must be the same size.

The string representation of a float doesn't work this way. For example, repr(1.3) yields '1.3', but repr(1.33) yields '1.3300000000000001'.

A accurate string representation of a floating point number produces a variable length string.

Because numpy arrays consist of elements that are all the same size, numpy requires you to specify the length of the strings within the array when you're using string arrays.

If you use x.astype('str'), it will always convert things to an array of strings of length 1.

For example, using x = np.array(1.344566), x.astype('str') yields '1'!

You need to be more explict and use the '|Sx' dtype syntax, where x is the length of the string for each element of the array.

For example, use x.astype('|S10') to convert the array to strings of length 10.

Even better, just avoid using numpy arrays of strings altogether. It's usually a bad idea, and there's no reason I can see from your description of your problem to use them in the first place...

Joe Kington
  • 275,208
  • 71
  • 604
  • 463
  • The reasoning for using numpy arrays of strings was because matplotlib requires a correctly shaped iterable of strings which represent numbers between 0 and 1 in order to represent grayscale, (which at the time I wanted). It seemed easiest to convert the array of numbers that I had to an array of strings. I wasn't anticipating the length complication. – V.S. Mar 23 '11 at 10:13
  • helpful also in this situation: 1.) read data from file 2.) assume all entries are `float`, however, some are `nan`. 3.) if all are read as float, there will be `double64` variables in the list which show up as `nan` but aren't recognized as `numpy.nan` 4.) in order to replace those, I successfully used: `if V[-1].astype('|S3') == 'nan': V[-1] = numpy.nan` – Schorsch Mar 21 '14 at 15:25
  • you can use np.genfromtxt and deal with this (more or less) automatically. It is always a bad idea to convert floats to strings if you intend to use them as float. – Vincenzooo May 16 '16 at 17:10
  • 2
    I know this is ~7 years old, but I'm commenting because this is no longer the case (python 3.6; np 1.14.0) – Mohammad Athar Feb 13 '18 at 17:17
21

If you have an array of numbers and you want an array of strings, you can write:

strings = ["%.2f" % number for number in numbers]

If your numbers are floats, the array would be an array with the same numbers as strings with two decimals.

>>> a = [1,2,3,4,5]
>>> min_a, max_a = min(a), max(a)
>>> a_normalized = [float(x-min_a)/(max_a-min_a) for x in a]
>>> a_normalized
[0.0, 0.25, 0.5, 0.75, 1.0]
>>> a_strings = ["%.2f" % x for x in a_normalized]
>>> a_strings
['0.00', '0.25', '0.50', '0.75', '1.00']

Notice that it also works with numpy arrays:

>>> a = numpy.array([0.0, 0.25, 0.75, 1.0])
>>> print ["%.2f" % x for x in a]
['0.00', '0.25', '0.50', '0.75', '1.00']

A similar methodology can be used if you have a multi-dimensional array:

new_array = numpy.array(["%.2f" % x for x in old_array.reshape(old_array.size)])
new_array = new_array.reshape(old_array.shape)

Example:

>>> x = numpy.array([[0,0.1,0.2],[0.3,0.4,0.5],[0.6, 0.7, 0.8]])
>>> y = numpy.array(["%.2f" % w for w in x.reshape(x.size)])
>>> y = y.reshape(x.shape)
>>> print y
[['0.00' '0.10' '0.20']
 ['0.30' '0.40' '0.50']
 ['0.60' '0.70' '0.80']]

If you check the Matplotlib example for the function you are using, you will notice they use a similar methodology: build empty matrix and fill it with strings built with the interpolation method. The relevant part of the referenced code is:

colortuple = ('y', 'b')
colors = np.empty(X.shape, dtype=str)
for y in range(ylen):
    for x in range(xlen):
        colors[x, y] = colortuple[(x + y) % len(colortuple)]

surf = ax.plot_surface(X, Y, Z, rstride=1, cstride=1, facecolors=colors,
        linewidth=0, antialiased=False)
Escualo
  • 40,844
  • 23
  • 87
  • 135
  • That wasn't the question. You're creating a `list`, he wants a numpy array. – Chinmay Kanchi Mar 19 '11 at 23:18
  • My apologies if it was not clear, but I'm dealing with numpy arrays, not python lists. What's more, my array is 2 dimensional, so a 1dim list comprehension wouldn't work. I'm fully aware that I can create an intermediate python list and then convert to a numpy array, but it seems like this method above should work and that it's extra (slow) programming to use an intermediate list. – V.S. Mar 19 '11 at 23:19
  • If an object can be iterated over (like a list or a numpy array) it supports the list comprehension. It does not need to be a list (duck typing) – Escualo Mar 19 '11 at 23:25
  • Yes, but you don't get a numpy array out, do you? – Chinmay Kanchi Mar 19 '11 at 23:31
  • Arrieta: it won't work because the list comprehension will be iterating over numpy.ndarrays, not single numbers, when a multidimensional array is used – robbles Mar 19 '11 at 23:36
  • You can flatten it and then reshape it – Escualo Mar 19 '11 at 23:37
  • plot_surface(X, Y, Z, facecolours = phis.astype('str')) – V.S. Mar 19 '11 at 23:59
1

This is probably slower than what you want, but you can do:

>>> tostring = vectorize(lambda x: str(x))
>>> numpy.where(tostring(phis).astype('float64') != phis)
(array([], dtype=int64),)

It looks like it rounds off the values when it converts to str from float64, but this way you can customize the conversion however you like.

robbles
  • 2,729
  • 1
  • 23
  • 30
  • This doesn't work either, which leads me to suggest that the conversion of very small numbers to strings, fails? I.e. the array contains numbers of the order 10^-30. – V.S. Mar 19 '11 at 23:36
  • you mean you get a different result? I tried it just now with a small 2D array and it worked – Maybe it is a bug... – robbles Mar 19 '11 at 23:38
  • Ok, now I see the same thing with really small numbers. Maybe it's a general floating-point math issue? – robbles Mar 19 '11 at 23:44
  • I do get a different result, but perhaps the limitation is not due to the order of magnitude of the number but the degree of precision?(whilst being described in scientific notation). Edit: If it's a floating point issue, what sort of floating point error mistakes a number much less than 1 as one around 8? haha – V.S. Mar 19 '11 at 23:44
1

If the main problem is the loss of precision when converting from a float to a string, one possible way to go is to convert the floats to the decimalS: http://docs.python.org/library/decimal.html.

In python 2.7 and higher you can directly convert a float to a decimal object.

ev-br
  • 24,968
  • 9
  • 65
  • 78
1

I ran into this problem when my pandas dataframes started having float precision issues that were bleeding into their string representations when doing df.round(2).astype(str).

I ended up going with np.char.mod("%.2f", phys), which uses broadcasting to run "%.2f".__mod__(el) on each element of the dataframe, instead of iterating in Python, which can make a pretty sizeable difference if your dataframes are large enough. Using limited-length string (like the accepted answer suggests) was a non-starter for me because keeping the decimals mattered more in my case than an exact number of significant digits.

I would have tried numpy.format_float_positional, which is the one used for formatting and is supposedly much faster than the stringf-equivalent used by Python, but that one doesn't work element-wise (or at all) on ndarrays and manual iteration was the part I was looking to avoid.

There's no ufunc for formatting, so as far as I can tell that's likely to be the most efficient way of doing it.

VLRoyrenn
  • 596
  • 5
  • 11