2

I have a numpy array like

np.array([[1.0, np.nan, 5.0, 1, True, True, np.nan, True],
       [np.nan, 4.0, 7.0, 2, True, np.nan, False, True],
       [2.0, 5.0, np.nan, 3, False, False, True, np.nan]], dtype=object)

Now I want to sort the values with key as isnan? How can I do that? So that I would end up in the array

np.array([[1.0, 5.0, 1, True, True, True, np.nan, np.nan],
   [4.0, 7.0, 2, True, False, True, np.nan, np.nan],
   [2.0, 5.0, 3, False, False, True, np.nan, np.nan]], dtype=object)

np.sort() didn't work. The same can be achieved in pandas by applying sorted over columns with sorted function with key as pd.isnull(), but looking for a numpy answer for speed.

In pandas

data = pd.DataFrame({'Key': [1, 2, 3], 'Var': [True, True, False], 'ID_1':[1, np.NaN, 2],
                'Var_1': [True, np.NaN, False], 'ID_2': [np.NaN, 4, 5], 'Var_2': [np.NaN, False, True],
                'ID_3': [5, 7, np.NaN], 'Var_3': [True, True, np.NaN]})

data.apply(lambda x : sorted(x,key=pd.isnull),1).values 

Output :

array([[1.0, 5.0, 1, True, True, True, nan, nan],
   [4.0, 7.0, 2, True, False, True, nan, nan],
   [2.0, 5.0, 3, False, False, True, nan, nan]], dtype=object)
Divakar
  • 218,885
  • 19
  • 262
  • 358
Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108

3 Answers3

5

Approach #1

Here's a vectorized approach borrowing the concept of masking from this post -

def mask_app(a):
    out = np.empty_like(a)
    mask = np.isnan(a.astype(float))
    mask_sorted = np.sort(mask,1)
    out[mask_sorted] = a[mask]
    out[~mask_sorted] = a[~mask]
    return out

Sample run -

# Input dataframe
In [114]: data
Out[114]: 
   ID_1  ID_2  ID_3  Key    Var  Var_1  Var_2 Var_3
0   1.0   NaN   5.0    1   True   True    NaN  True
1   NaN   4.0   7.0    2   True    NaN  False  True
2   2.0   5.0   NaN    3  False  False   True   NaN

# Use pandas approach for verification    
In [115]: data.apply(lambda x : sorted(x,key=pd.isnull),1).values
Out[115]: 
array([[1.0, 5.0, 1, True, True, True, nan, nan],
       [4.0, 7.0, 2, True, False, True, nan, nan],
       [2.0, 5.0, 3, False, False, True, nan, nan]], dtype=object)

# Use proposed approach and verify
In [116]: mask_app(data.values)
Out[116]: 
array([[1.0, 5.0, 1, True, True, True, nan, nan],
       [4.0, 7.0, 2, True, False, True, nan, nan],
       [2.0, 5.0, 3, False, False, True, nan, nan]], dtype=object)

Approach #2

With few more modifications, a simplified version with the idea from this post -

def mask_app2(a):
    out = np.full(a.shape,np.nan,dtype=a.dtype)
    mask = ~np.isnan(a.astype(float))
    out[np.sort(mask,1)[:,::-1]] = a[mask]
    return out
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • 1
    I was waiting for you . :) – Bharath M Shetty Sep 20 '17 at 16:04
  • 2
    I would love to give bounty for this solution. This is beautiful. – Bharath M Shetty Sep 20 '17 at 16:08
  • Sir a small question how long have you been working with vecotrizing a solution. – Bharath M Shetty Sep 20 '17 at 16:21
  • @Bharathshetty Vectorizing this particular solution, you mean? – Divakar Sep 20 '17 at 16:22
  • Sir no no no your experience with numpy and vectorization . You can vectorize any kind of for loops :) so – Bharath M Shetty Sep 20 '17 at 16:23
  • 1
    @Bharathshetty It's been a while. I started off with MATLAB. Loved vectorizing stuffs on it. Heard about NumPy and jumped on it and it has its own unique/interesting capabilities and have been hooked ever since to it. I get to answer MATLAB questions sometimes too these days. But yeah I generally try to think that I need to avoid loops and that helps I think :) – Divakar Sep 20 '17 at 16:25
  • 1
    I should try to vectorize as many for loops as I can. I too don't like loops. Though vectorizing is very very tricky and my dumb brain is not ready for it yet. :) :) – Bharath M Shetty Sep 20 '17 at 16:27
2

Since you have an object array anyway, do the sorting in Python, then make your array. You can write a key that does something like this:

from math import isnan

def key(x):
    if isnan(x):
        t = 3
        x = 0
    elif isinstance(x, bool):
        t = 2
    else:
        t = 1
    return t, x

This key returns a two-element tuple, where the first element gives the preliminary ordering by type. It considers all NaNs to be equal and greater than any other type.

Even if you start with data in a DataFrame, you can do something like:

values = [list(sorted(row, key=key)) for row in data.values]
values = np.array(values, dtype=np.object)

You can replace the list comprehension with np.apply_along_axis if that suits your needs better:

values = np.apply_along_axis(lambda row: np.array(list(sorted(row, key=key))),
                             axis=1, arr=data.values)
Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
  • Can this be done with something like `apply_along_axis`? – Bharath M Shetty Sep 20 '17 at 15:55
  • @Bharathshetty. You can replace the list comprehension with `apply_along_axis`. I will show an example, but I doubt it will speed things up any. You will still be using the Python `sorted` function and a Python key. – Mad Physicist Sep 20 '17 at 15:59
  • The problem is that I am not aware of any way to specify a custom key to the numpy machinery. There may be one, but I have looked at this in *a lot* of detail. – Mad Physicist Sep 20 '17 at 16:02
0

You can't do this with an object array and nan You would need to find a numeric type everything would fit into. When used as an object instead of as a float, nan returns false for <, >, and ==.

Additionally, True and False are equivalent to 0 and 1, so I don't think there is any way to get your expected result.

You would have to see if converting the dtype to float would give you proper results for your use case.

Edward Minnix
  • 2,889
  • 1
  • 13
  • 26