1

I have a dataframe created from collected sampled data. I then manipulate the dataframe to remove duplicates, sort, and remove saturated values:

df = pd.read_csv(path+ newfilename, header=0, usecols=[0,1,2,3,5,7,10],
                names=['ch1_real', 'ch1_imag', 'ch2_real', 'ch2_imag', 'ch1_log_mag', 'ch1_phase',
                      'ch2_log_mag', 'ch2_phase', 'pr_sample_real', 'pr_sample_imag', 'distance'])    
tmp=df.drop_duplicates(subset='distance', keep='first').copy()
tmp.sort_values("distance", inplace=True)
dfUnique=tmp[tmp.distance <65000].copy()

I also add two calculated values (with help from @Stef) dfUnique['ch1_log_mag']=20np.log10((dfUnique.ch1_real +1jdfUnique.ch1_imag).abs()) dfUnique['ch2_log_mag']=20np.log10((dfUnique.ch2_real +1jdfUnique.ch2_imag).abs())

the problem arises when I try to find the index of the maximum magnitude. It turns out (unexpectedly to me), that dataframes keep there original data indices. So, after sorting and removing rows, the index of given row is not its index in the new ordered dataframe, but its row index within the original dataframe:

         ch1_real  ch1_imag  ch2_real  ...  distance  ch1_log_mag  ch2_log_mag
79   0.011960 -0.003418  0.005127  ...       0.0   -38.104414   -33.896518
78  -0.009766 -0.005371 -0.015870  ...       1.0   -39.058001   -34.533870
343  0.002197  0.010990  0.003662  ...       2.0   -39.009865   -37.278737
80  -0.002686  0.010740  0.011960  ...       3.0   -39.116435   -34.902513
341 -0.007080  0.009033  0.016600  ...       4.0   -38.803434   -35.582833
81  -0.004883 -0.008545 -0.016850  ...      12.0   -40.138523   -35.410047
83  -0.009277  0.004883 -0.000977  ...      14.0   -39.589769   -34.848170
84   0.006592 -0.010250 -0.009521  ...      27.0   -38.282239   -33.891250
85   0.004395  0.010010  0.017580  ...      41.0   -39.225735   -34.890353
86  -0.007812 -0.005127 -0.015380  ...      53.0   -40.589187   -35.625615

When I then use:

np.argmax(dfUnique.ch1_log_mag)

to find the index of maximum magnitude, this returns the index in the new ordered dataframe series. But, when I use this to index into the dataframe to extract other values in that row, I get elements from the original dataframe at that row index.

I exported the dataframe to excel to more easily observe what was happening. Column 1 is the dataframe index. Notice that is is different than the row number on the spreadsheet.

enter image description here

The np.argmax command above returns 161. If I look at the new ordered dataframe, index 161 is this row highlighted below (data starts on row two in the spreadsheet, and indices start at 0 in python): enter image description here and is correct. However, per the original dataframes order, this was at index 238. When I then try to access ch1_log_max[161],

dfUnique.ch1_log_mag[161]

I get -30.9759, instead of -11.453. It grabbed the value using 161 as the index into original dataframe:

enter image description here

this is pretty scary --that two functions use two different reference frames (at least to a novice python user). How do I avoid this? (How) Do I reindex the dataframe? or should I be using an equivalent pandas way of finding the maximum in a series within a dataframe (assuming the issue is due to how pandas and numpy operate on data)? Is the issue the way I'm creating copies of the dataframe?

jrive
  • 218
  • 2
  • 3
  • 14
  • `np.argmax` is a `numpy` function. If given a `pandas` series it's going work with the `values` array from that series. It's a relative position. It does not pay attention to the `pandas` defined Series index. Look for something equivalent in `pandas` that does use the index. – hpaulj Jan 15 '21 at 17:54

1 Answers1

1

If you sort a dataframe, it preserves indices.

import pandas as pd
a = pd.DataFrame(np.random.randn(24).reshape(6,4), columns=list('abcd'))
a.sort_values(by='d', inplace=True)
print(a)
>>>
          a         b         c         d
2 -0.553612  1.407712 -0.454262 -1.822359
0 -1.046893  0.656053  1.036462 -0.994408
5 -0.772923 -0.554434 -0.254187 -0.948573
4 -1.660773  0.291029  1.785757 -0.457495
3  0.128831  1.399746  0.083545 -0.101106
1 -0.250536 -0.045355  0.072153  1.871799

In order to reset index, you can use .reset_index(drop=True):

b = a.sort_values(by='d').reset_index(drop=True)
print(b)
>>>
          a         b         c         d
0 -0.553612  1.407712 -0.454262 -1.822359
1 -1.046893  0.656053  1.036462 -0.994408
2 -0.772923 -0.554434 -0.254187 -0.948573
3 -1.660773  0.291029  1.785757 -0.457495
4  0.128831  1.399746  0.083545 -0.101106
5 -0.250536 -0.045355  0.072153  1.871799

To find the original index of max value, you can use .idxmax() then use .loc[]:

ix_max = a.d.idxmax()
# or ix_max = np.argmax(a.d)
print(f"ix_max = {ix_max}")
a.loc[ix_max]
>>>
ix_max = 1
a   -0.250536
b   -0.045355
c    0.072153
d    1.871799
Name: 1, dtype: float64

or if you have got new index order, you can use .iloc:

iix = np.argmax(a.d.values)
print(f"iix = {iix}")
print(a.iloc[iix])
>>>
iix = 5
a   -0.250536
b   -0.045355
c    0.072153
d    1.871799
Name: 1, dtype: float64

You can have a look at https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

armamut
  • 1,087
  • 6
  • 14