2

A collaborator and I noticed something odd in numpy that we don't understand. This occurs using Python 3.5.4 and numpy version 1.14.2-py35ha9ae307_1 (plus an earlier one, which I updated just in case).

The issue seems to be that if a add a float to a numpy array along with some strings, the float gets converted to a string as expected, but sometimes (very rarely) the float gets truncated in a very odd way. I don't know if this is a bug or just some behaviour we don't understand. Either way it seems bizarre. Any insight would be useful.

Reproducible example

import numpy as np
p = np.empty([1,2],dtype='U21')
a = 4.4226657709978134e-05
p[0] = np.array(['string',a])
p

# WTF
Out[5]: array([['string', '4.4226657709978134e-0']], dtype='<U21')

It depends on the final digit of the float too

# Works as expected
In [26]: np.array(['string',4.4226657709978130e-05], dtype='<U21')
Out[26]: array(['string', '4.422665770997813e-05'], dtype='<U21')

# Works as expected
In [27]: np.array(['string',4.4226657709978131e-05], dtype='<U21')
Out[27]: array(['string', '4.422665770997813e-05'], dtype='<U21')

# Doesn't work as expected
In [28]: np.array(['string',4.4226657709978132e-05], dtype='<U21')
Out[28]: array(['string', '4.4226657709978134e-0'], dtype='<U21')

# Doesn't work as expected
In [29]: np.array(['string',4.4226657709978133e-05], dtype='<U21')
Out[29]: array(['string', '4.4226657709978134e-0'], dtype='<U21')

# Doesn't work as expected
In [30]: np.array(['string',4.4226657709978134e-05], dtype='<U21')
Out[30]: array(['string', '4.4226657709978134e-0'], dtype='<U21')

# Doesn't work as expected
In [31]: np.array(['string',4.4226657709978135e-05], dtype='<U21')
Out[31]: array(['string', '4.4226657709978134e-0'], dtype='<U21')

# Doesn't work as expected
In [32]: np.array(['string',4.4226657709978136e-05], dtype='<U21')
Out[32]: array(['string', '4.4226657709978134e-0'], dtype='<U21')

# Doesn't work as expected
In [33]: np.array(['string',4.4226657709978137e-05], dtype='<U21')
Out[33]: array(['string', '4.4226657709978134e-0'], dtype='<U21')

# Works as expected
In [34]: np.array(['string',4.4226657709978138e-05], dtype='<U21')
Out[34]: array(['string', '4.422665770997814e-05'], dtype='<U21')

# Works as expected
In [35]: np.array(['string',4.4226657709978139e-05], dtype='<U21')
Out[35]: array(['string', '4.422665770997814e-05'], dtype='<U21')

The issue is trivial to fix, e.g. by switching to a Pandas data frame that can deal with different types. But the behaviour seems odd. We noticed it only because we were doing this on millions of numbers and the sanity checks highlighted it (all our numbers should be <1, and we very occasionally started getting numbers >1).

roblanf
  • 1,741
  • 3
  • 18
  • 24
  • Could it be related to [this issue] (https://github.com/pandas-dev/pandas/issues/4405 ) raised on Github? – Watty62 May 16 '18 at 10:27

2 Answers2

4

This isn't really anything about Numpy. See https://stackoverflow.com/a/25899600/982257

Python(3) will generally represent floats as strings with the fewest digits necessary to represent that particular float value unambiguously.

In the cases of both 4.4226657709978137e-05 and 4.4226657709978138e-05 neither are represented exactly by IEEE doubles. In the case of 4.4226657709978137e-05 its shortest unambiguous representation just happens to be 22 characters, not 21, so when you try to stuff it into a <U21 it gets truncated.

To represent most doubles in scientific notation you want at least 24 characters.

Iguananaut
  • 21,810
  • 5
  • 50
  • 63
1

If you want to mix strings and floats in an array you don't have use pandas. Object dtype works (that's what pandas uses)

In [394]: a = 4.4226657709978134e-05
In [395]: np.array(['string',a])
Out[395]: array(['string', '4.4226657709978134e-05'], dtype='<U22')
In [396]: np.array(['string',a], object)
Out[396]: array(['string', 4.4226657709978134e-05], dtype=object)

Or structured dtype:

In [398]: np.array([('string',a)],'U10,float')
Out[398]: array([('string', 4.42266577e-05)], dtype=[('f0', '<U10'), ('f1', '<f8')])
In [399]: _.item()
Out[399]: ('string', 4.4226657709978134e-05)
hpaulj
  • 221,503
  • 14
  • 230
  • 353