2

I have some problems and hope that you can help me. The data goes as follows:

Data

The column are labeled as height and the numbers are the available data for that timeframe. So for example for the first row, at the height of 288m only 10% of the data is available (the other 90% was not able to be measured)

I want to set a threshhold value, for example 80. So only heights which contain at least 80% of available data should be included. For each row I want to find the last value for which >= 80 is still true. The height in which this value was found should then be added to a new column.

I can show the rows with: df.iloc[0], this would get me all the values of the first row... normally I would write a loop but since this is Pandas DataFrame there should be a more convenient option for getting the last value for which the condition >= 80 is true.

The Pseudocode would look like this:

  • for every row in the data frame check if value is >= 80
  • extract the column name for the last value this condition true
  • append this value to a new column called 'height value'

I hope I made myself somewhat clear. I am quite new to pandas and seem to struggle quite a bit with data extraction based on conditional expression.

EDIT:

The pandas code for the first 3 rows is: (time index is not important so I didn't include it)

pdf = pd.DataFrame([[100.0, 100.0, 100.0, 100.0, 100.0, 99.0, 97.0, 93.0, 71.0, 45.0, 27.0, 10.0],
                    [100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 94.0, 81.0, 62.0, 36.0, 9.0],
                    [100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 92.0, 90.0, 68.0, 45.0, 25.0, 13.0]],
                    columns=[40.0, 58.0, 78.0, 98.0, 118.0, 138.0, 163.0, 178.0, 198.0, 228.0, 248.0, 288.0])

EDIT 2: The output data should look like this:

result = pd.DataFrame([[100.0, 100.0, 100.0, 100.0, 100.0, 99.0, 97.0, 93.0, 71.0, 45.0, 27.0, 10.0, 178.0],
                    [100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 94.0, 81.0, 62.0, 36.0, 9.0, 198.0],
                    [100.0, 100.0, 100.0, 100.0, 100.0, 100.0, 92.0, 90.0, 68.0, 45.0, 25.0, 13.0, 178.0]],
                    columns=[40.0, 58.0, 78.0, 98.0, 118.0, 138.0, 163.0, 178.0, 198.0, 228.0, 248.0, 288.0,'max_available_height'])
Medhusalem
  • 65
  • 2
  • 10
  • are you able to use a textual sample of your dataframe and paste that instead, also please give your expected output. 5-10 rows of each is fine. – Umar.H Nov 18 '19 at 12:36
  • 1
    have a read of this [How to ask a good pandas question](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) then edit your question in-line. – Umar.H Nov 18 '19 at 12:39
  • Sorry for the inconvinience, I edited my question and put the first 3 rows in. – Medhusalem Nov 18 '19 at 12:49

1 Answers1

3

If match any value test it with DataFrame.any and numpy.where and DataFrame.idxmax with DataFrame.iloc for swap ordering for last matched value else return default value - e.g. not match:

thresh = 80
mask = pdf >= thresh
pdf['height value'] = np.where(mask.any(axis=1), mask.iloc[:,::-1].idxmax(axis=1), 'no match')
print (pdf)
    40.0   58.0   78.0   98.0  118.0  138.0  163.0  178.0  198.0  228.0  \
0  100.0  100.0  100.0  100.0  100.0   99.0   97.0   93.0   71.0   45.0   
1  100.0  100.0  100.0  100.0  100.0  100.0  100.0   94.0   81.0   62.0   
2  100.0  100.0  100.0  100.0  100.0  100.0   92.0   90.0   68.0   45.0   

   248.0  288.0 height value  
0   27.0   10.0        178.0  
1   36.0    9.0        198.0  
2   25.0   13.0        178.0 

Details:

 print (mask)
   40.0   58.0   78.0   98.0   118.0  138.0  163.0  178.0  198.0  228.0  \
0   True   True   True   True   True   True   True   True  False  False   
1   True   True   True   True   True   True   True   True   True  False   
2   True   True   True   True   True   True   True   True  False  False   

   248.0  288.0  
0  False  False  
1  False  False  
2  False  False  

print (mask.iloc[:,::-1])
   288.0  248.0  228.0  198.0  178.0  163.0  138.0  118.0  98.0   78.0   \
0  False  False  False  False   True   True   True   True   True   True   
1  False  False  False   True   True   True   True   True   True   True   
2  False  False  False  False   True   True   True   True   True   True   

   58.0   40.0   
0   True   True  
1   True   True  
2   True   True  
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • In this example though, the values for height values I would like to have are: 0 - 178.0 1 - 198.0 2 - 178.0 Since those are the last values where the threshhold of 80 is true. – Medhusalem Nov 18 '19 at 12:53
  • @Medhusalem - Can you change data sample for create [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve)? – jezrael Nov 18 '19 at 12:57
  • Sorry, I don't quite understand what you mean. The data sample contains 10 columns. Each column is for a different height. The values in the rows are the availablity of the Data. (The data I have is from 1Hz measurements which then got put into a 10 minute average) So the value in each row just says how many values of the 10min intervals were valid. (I don't have access to the 1Hz data). – Medhusalem Nov 18 '19 at 13:05
  • @Medhusalem - Ok, so what is expected output from sample data? – jezrael Nov 18 '19 at 13:07
  • Like I wrotein the first reply: what I want is the height (the column name) for which the last >= 80 is true. For the first row it would be 178.0, for the second row it would be 198.0 and for the third row it would be 178.0. My aim is to find out for each row which was the maximal height where the measurements still meet the threshhold (in this case 80). – Medhusalem Nov 18 '19 at 13:14
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/202573/discussion-between-medhusalem-and-jezrael). – Medhusalem Nov 18 '19 at 13:24