Is there a better way to convert 'object' type array to numpy array by replacing 'na' with mean?

Question

I have an array of strings with some elements such as 'na' that can't be converted to float by using x.astype(np.float) as given here.

Please suggest any better way than the way I did it. Please find the procedure below (it is a snippet from my jupyter notebook, I have shown the intermediate steps just to demonstrate the changes):

In [4]: val_inc

Out [4]:

array(['na', '38.012', '38.7816', '38.0736', '40.7118', '44.7382',
       '39.6416', '38.9177', '36.9031', 43.2611, '38.2732', 40.7129,
       '37.2844', '39.5835', 43.9194, '42.5485', '36.9052', 'na', 41.9264,
       45.3568, '44.6239', 38.1079, 45.2393, '32.785', '44.6239',
       '38.0216', '38.4608', '42.5644', '35.3127', 33.2936, '33.0556',
       '40.4476', 35.6581, '35.5574', '43.1096', '34.4751', 42.0554,
       40.3944, '40.2466', '32.2567', 'na', '38.8594', '43.947', 41.7973,
       '41.8105', 40.3797, 31.2868, '45.3644', '40.7177', '41.8558',
       '38.9249', '33.2077', '42.4053', '42.559'], dtype=object)

In [5]: val_inc[val_inc == 'na']='0'

In [6]: val_inc

Out [6]:

array(['0', '38.012', '38.7816', '38.0736', '40.7118', '44.7382',
       '39.6416', '38.9177', '36.9031', 43.2611, '38.2732', 40.7129,
       '37.2844', '39.5835', 43.9194, '42.5485', '36.9052', '0', 41.9264,
       45.3568, '44.6239', 38.1079, 45.2393, '32.785', '44.6239',
       '38.0216', '38.4608', '42.5644', '35.3127', 33.2936, '33.0556',
       '40.4476', 35.6581, '35.5574', '43.1096', '34.4751', 42.0554,
       40.3944, '40.2466', '32.2567', '0', '38.8594', '43.947', 41.7973,
       '41.8105', 40.3797, 31.2868, '45.3644', '40.7177', '41.8558',
       '38.9249', '33.2077', '42.4053', '42.559'], dtype=object)

In [7]: val_inc = val_inc.astype(np.float)

In [8]: val_inc

Out [8]:

array([  0.    ,  38.012 ,  38.7816,  38.0736,  40.7118,  44.7382,
        39.6416,  38.9177,  36.9031,  43.2611,  38.2732,  40.7129,
        37.2844,  39.5835,  43.9194,  42.5485,  36.9052,   0.    ,
        41.9264,  45.3568,  44.6239,  38.1079,  45.2393,  32.785 ,
        44.6239,  38.0216,  38.4608,  42.5644,  35.3127,  33.2936,
        33.0556,  40.4476,  35.6581,  35.5574,  43.1096,  34.4751,
        42.0554,  40.3944,  40.2466,  32.2567,   0.    ,  38.8594,
        43.947 ,  41.7973,  41.8105,  40.3797,  31.2868,  45.3644,
        40.7177,  41.8558,  38.9249,  33.2077,  42.4053,  42.559 ])

In [9]: np.mean(val_inc[val_inc!=0.])

Out [9]: 39.587374509803915

In [10]: val_inc[val_inc==0.]=np.mean(val_inc[val_inc!=0.])

In [11]: val_inc

Out [11]:

array([ 39.58737451,  38.012     ,  38.7816    ,  38.0736    ,
        40.7118    ,  44.7382    ,  39.6416    ,  38.9177    ,
        36.9031    ,  43.2611    ,  38.2732    ,  40.7129    ,
        37.2844    ,  39.5835    ,  43.9194    ,  42.5485    ,
        36.9052    ,  39.58737451,  41.9264    ,  45.3568    ,
        44.6239    ,  38.1079    ,  45.2393    ,  32.785     ,
        44.6239    ,  38.0216    ,  38.4608    ,  42.5644    ,
        35.3127    ,  33.2936    ,  33.0556    ,  40.4476    ,
        35.6581    ,  35.5574    ,  43.1096    ,  34.4751    ,
        42.0554    ,  40.3944    ,  40.2466    ,  32.2567    ,
        39.58737451,  38.8594    ,  43.947     ,  41.7973    ,
        41.8105    ,  40.3797    ,  31.2868    ,  45.3644    ,
        40.7177    ,  41.8558    ,  38.9249    ,  33.2077    ,
        42.4053    ,  42.559     ])

Replace `'na'` with `'nan'` and it will be convertible to floating point. — MB-F, Jan 09 '18 at 08:11
@kazemakase thanks for your suggestion. I was not aware that string 'nan' could have been directly converted to np.nan — thepunitsingh, Jan 10 '18 at 18:16
Apologies that my question turned out to be a duplicate, I will work on my searching skills. — thepunitsingh, Jan 10 '18 at 18:24
no need to apologize... on the contrary, being marked as a duplicate your question now serves as a sign-post for others who may be looking for the same search terms as you did. — MB-F, Jan 10 '18 at 18:34

Julien · Accepted Answer · 2018-01-09T08:41:26.917

3

replace 'na' with 'nan' so it is then converted to np.nan, then use np.nanmean.

example:

test = np.array(['0','1','nan'], dtype=float)
np.where(np.isnan(test), np.nanmean(test), test)

array([ 0. ,  1. ,  0.5])

edited Jan 09 '18 at 08:41

answered Jan 09 '18 at 08:11

Julien

13,986
5
29
53

Your suggestion was the fastest way to solve my problem among other suggestions. Thanks! – thepunitsingh Jan 10 '18 at 18:22

score 2 · Answer 2 · answered Jan 09 '18 at 08:36

Better would be to first convert 'na' to proper NaN. Then one can use the data anyway one wants:

import numpy as np
val_inc[val_inc == 'na'] = np.nan   # 'na' to proper NaN or missing value
val_inc = val_inc.astype(np.float)  # no error here now.
print(val_inc)

Ouput:

[     nan  38.012   38.7816  38.0736  40.7118  44.7382  39.6416  38.9177
  36.9031  43.2611  38.2732  40.7129  37.2844  39.5835  43.9194  42.5485
  36.9052      nan  41.9264  45.3568  44.6239  38.1079  45.2393  32.785
  44.6239  38.0216  38.4608  42.5644  35.3127  33.2936  33.0556  40.4476
  35.6581  35.5574  43.1096  34.4751  42.0554  40.3944  40.2466  32.2567
      nan  38.8594  43.947   41.7973  41.8105  40.3797  31.2868  45.3644
  40.7177  41.8558  38.9249  33.2077  42.4053  42.559 ]

Is there a better way to convert 'object' type array to numpy array by replacing 'na' with mean?

2 Answers2