2

I have pandas DataFrame and I turn it to numpy ndarray.I use max function for one column in my DataFrame like this:

print('column: ',df[:,3])
print('max: ',np.max(df[:,3]))

And the output was:

column: [0.6559999999999999 0.48200000000000004 0.9990000000000001 ..., 1.64 nan 0.07]
max: 0.07

But as you can see for example first value is greater than 0.07!! What is the problem?

2 Answers2

3

There are two problems here



  1. It looks like column you are trying to find maximum for has the data type object. It's not recommended if you are sure that your column should contain numerical data since it may cause unpredictable behaviour not only in this particular case. Please check data types for your dataframe(you can do this by typing df.dtypes) and change it so that it corresponds to data you expect(for this case df[column_name].astype(np.float64)). This is also the reason for np.nanmax not working properly.

  2. You don't want to use np.max on arrays, containing nans.



Solution



  1. If you are sure about having object data type of column:

    1.1. You can use the max method of Series, it should cast data to float automatically.

    df.iloc[3].max()

    1.2. You can cast data to propper type only for nanmax function.

    np.nanmax(df.values[:,3].astype(np.float64)

    1.3 You can drop all nan's from dataframe and find max[not recommended]:

    np.max(test_data[column_name].dropna().values)
    

  1. If type of your data is float64 and it shouldn't be object data type [recommended]:

    df[column_name] = df[column_name].astype(np.float64)
    
    np.nanmax(df.values[:,3])
    


Code to illustrate problem



#python
import pandas as pd
import numpy as np 

test_data = pd.DataFrame({
                   'objects_column': np.array([0.7,0.5,1.0,1.64,np.nan,0.07]).astype(object),
                   'floats_column': np.array([0.7,0.5,1.0,1.64,np.nan,0.07]).astype(np.float64)})

print("********Using np.max function********")
print("Max of objects array:", np.max(test_data['objects_column'].values))
print("Max of floats array:", np.max(test_data['floats_column'].values))

print("\n********Using max method of series function********")
print("Max of objects array:", test_data["objects_column"].max()) 
print("Max of floats array:", test_data["objects_column"].max()) 

Returns:

********Using np.max function********
Max of objects array: 0.07
Max of floats array: nan

********Using max method of series function********
Max of objects array: 1.64
Max of floats array: 1.64
Stas Buzuluk
  • 794
  • 9
  • 19
1

np.max is an alias for the function np.amax which according to documentation doesn't play well with NaN values. In order to ignore NaN values you should use np.nanmax instead

jovany merham
  • 91
  • 1
  • 8
  • That's a good assumption but not a correct answer. It looks like a real problem was related to an improper data type. As specified in numpy.amax documentation in case if there's nan in array - amax returns nan, which is not the case in this situation. https://numpy.org/doc/stable/reference/generated/numpy.amax.html – Stas Buzuluk Aug 31 '20 at 13:54
  • There's a discussion that extends question a little bit: https://chat.stackoverflow.com/rooms/220618/discussion-between-mohammad-sadra-sharifzadeh-and-stas-buzuluk – Stas Buzuluk Aug 31 '20 at 14:17