Deleting nan from a string array

Question

I have a following array:

data=array([['beef', 'bread', 'cane_molasses', nan, nan, nan],
       ['brassica', 'butter', 'cardamom']])

How can I delete the nan's to get:

 array([['beef', 'bread', 'cane_molasses'],
       ['brassica', 'butter', 'cardamom']])

I have tried the method given in here but this does not work as in my case my array is of higher dimension and is not a simple vector.

Your array is 1d, shape (2,). But it contains lists. You could apply the linked answer to each of those lists. For most purposes your array is a list - a list of lists. — hpaulj, Nov 12 '18 at 17:07

jpp · Accepted Answer · 2018-11-13T22:35:21.783

2

object dtype arrays do not support vectorised operations. But you can do a round trip converting first to list and then back to an array. Here we use the fact np.nan != np.nan by design:

data = np.array([['beef', 'bread', 'cane_molasses', np.nan, np.nan, np.nan],
                 ['brassica', 'butter', 'cardamom']])

res = np.array([[i for i in row if i == i] for row in data.tolist()])

array([['beef', 'bread', 'cane_molasses'],
       ['brassica', 'butter', 'cardamom']], 
      dtype='<U13')

Note the resultant array will be of string types (here with max length of 13). If you want an object dtype array, which can hold arbitrary objects, you need to specify dtype=object:

res = np.array([[i for i in row if i == i] for row in data.tolist()], dtype=object)

array([['beef', 'bread', 'cane_molasses'],
       ['brassica', 'butter', 'cardamom']], dtype=object)

edited Nov 13 '18 at 22:35

answered Nov 12 '18 at 16:24

jpp

159,742
34
281
339

This is an elegant solution but a very dangerous piece of code to include in any data processing pipeline as it will break silently if the not-a-number specification changes. – Paul Brodersen Nov 13 '18 at 22:26
@PaulBrodersen, `np.nan != np.nan` is fundamental to `NaN` as a concept, e.g. the docs for `np.isnan` have "NumPy uses the IEEE Standard for Binary Floating-Point for Arithmetic (IEEE 754)." The rationale is built into IEEE 754 ([see here](https://stackoverflow.com/a/1573715/9209546)). It may not be seemly, but neither is it the worst assumption. – jpp Nov 13 '18 at 22:39
1

Sorry, I was maybe too imprecise. I am not worried about the not-a-number standard changing in numpy, I am worried about OP changing the way he or she imports data such that for example `nan` becomes `'nan'`, etc. – Paul Brodersen Nov 13 '18 at 22:41
@PaulBrodersen, That's a fair point, thanks for raising it. My solution does indeed assume the user can rely on null values being `np.nan`. – jpp Nov 13 '18 at 22:43

Deleting nan from a string array

1 Answers1