0

I hope you're all doing well.

So I've been working with Excel my whole life and I'm now switching to Python & Pandas. The Learning curve is proving to be quite steep for me, so please bare with me.

Day after day it's getting better. I've already managed to aggregate values, input/ouput from csv/excel, drop "na" values and much more. However, I've stumbeled upon a wall to high for me to climb right now...

I created an extract of the dataframe I'm working with. You can download it here, so you can understand what I'll be writing about: https://filetransfer.io/data-package/pWE9L29S#link

df_example

t_stamp,1_wind,2_wind,3_wind,4_wind,5_wind,6_wind,7_wind,1_wind_Q,2_wind_Q,3_wind_Q,4_wind_Q,5_wind_Q,6_wind_Q,7_wind_Q
2021-06-06 18:20:00,12.14397093693768,12.14570426940918,10.97993184016605,11.16468568605988,9.961717914791588,10.34653735907099,11.6856901451427,True,False,True,True,True,True,True
2021-05-10 19:00:00,8.045154709031468,8.572511270557484,8.499070711427668,7.949358210396142,8.252115912454919,7.116505042782365,8.815732567915179,True,True,True,True,True,True,True
2021-05-27 22:20:00,8.38946901817802,6.713454777683985,7.269814675171176,7.141862659613969,12.38239256540934,14.23146015260278,11.26704854500004,True,True,True,True,False,False,False
2021-05-05 18:20:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False,False,False,False,False,False,False
2021-06-06 12:20:00,11.95525872119988,12.14570426940918,12.26086164116684,12.89527716859738,11.77172234144684,12.12409015586662,12.52180822809299,True,False,True,True,True,True,True
2021-06-04 03:30:00,14.72553364088618,12.72900662616056,10.59386275508178,10.96070182287055,12.38239256540934,12.07846616943932,10.58384464064597,True,True,True,True,False,True,True
2021-05-05 13:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False,False,False,False,False,False,False
2021-05-24 18:10:00,17.12270521348523,16.22721748967324,14.15318916689965,19.35395873243158,17.60747853230812,17.18577813727543,17.70745523935796,False,False,False,False,True,True,True
2021-05-07 19:00:00,13.94341927008482,10.95456999345216,13.36533234604886,0.0,3.782910539990379,10.86996953698871,13.45072022532649,True,True,True,False,False,True,True
2021-05-13 00:40:00,10.70940582779898,10.22222264510213,9.043496015164536,9.03805802580422,11.53775481234347,10.09538681656049,10.19345618536208,True,True,True,True,True,True,True
2021-05-27 19:40:00,10.8317678500958,7.929683248532885,8.264301219025942,8.184133252794958,12.38239256540934,14.23146015260278,11.26704854500004,True,True,True,True,False,False,False
2021-05-09 12:00:00,10.55571650269678,7.635778078425459,10.43683108425784,7.847532146733346,8.100127641989639,7.770247510198059,8.040702032061867,True,True,True,True,True,True,True
2021-05-19 19:00:00,2.322496225799398,2.193219010982461,2.301622604435732,2.204278609893358,2.285408405883714,1.813280858368885,1.667207419773053,True,True,True,True,True,True,True
2021-05-30 12:30:00,5.776450801637788,8.488826231951345,10.98525552709715,7.03016556196849,12.38239256540934,14.23146015260278,11.26704854500004,True,True,True,True,False,False,False
2021-05-24 14:10:00,17.12270521348523,16.22721748967324,14.15318916689965,19.35395873243158,17.93466266883504,17.04697174496121,17.0739475214739,False,False,False,False,True,False,True

What you are looking at:

"n" represents the number of measuring points.

  • First column: Timestamp of values
  • Columns index 1 to "n": Average windspeed at different points, of the last 10 minutes
  • Columns index "n+1" to last (-1): Qualifies if the value of the respective point is valid (True) or invalid (False). So to the value "1_wind", the qualifier "1_wind_Q" applies

Want I'm trying to achieve: The goal is to create a new column called "Avg_WS" which iterates through every row and calculates the following:

  • Average of the value ranges, ONLY if the corresponding Qualifier is TRUE

Example: So if in a given row, the column "4_wind_Q" is "False", the value "4_wind" should be excluded from the average on that given row.

Extra: If all Qualifiers are "False" in a given row, "Avg_WS" should equal to "NaN" in that same row.

I've tried to use apply, but I can't figure out how to match the pairs of value-qualifier

Thnak you so much in advanced!

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
DiegoRC
  • 13
  • 3
  • 2
    Hi! Can you copy and paste some rows of your dataframe directly into the question rather than giving us a file to download? [This question](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) has good advice about how to ask a good pandas question – tomjn Jun 14 '21 at 16:25
  • 1
    Hi! @tomjn. I think I did it right, it's pasted as code, although it doesn't look nice. – DiegoRC Jun 14 '21 at 17:23

1 Answers1

0

I tried using mask for this.

quals = ['1_wind_Q','2_wind_Q','3_wind_Q','4_wind_Q','5_wind_Q','6_wind_Q','7_wind_Q']
fields = ['1_wind', '2_wind', '3_wind', '4_wind', '5_wind', '6_wind', '7_wind']

df[fields].mask( ~df[quals].values ).mean( axis=1 )

# output

0     11.047089
1      8.178635
2      7.378650
3           NaN
4     12.254836
5     11.945236
6           NaN
7     17.500237
8     12.516802
9     10.119969
10     8.802471
11     8.626705
12     2.112502
13     8.070175
14    17.504305
dtype: float64

# assign this to the dataframe
df.loc[ :, 'Avg_WS' ] = df[fields].mask( ~df[quals].values ).mean( axis=1 )

mask works by essentially applying a boolean mask on each of the "fields" - the caveat is the bool mask must be the same shape as the data you are trying applying it on (i.e. must have same dimensions n x m)

mean( axis= 1 ) tells the data frame to apply the mean function across each row ( rather than column which axis=0 would imply.

Reuben
  • 68
  • 7
  • Thank you very much @Reuben ! This worked just fine, I didn't know about this command. Suddenly a whole new spectrum of possibilities just opened up. – DiegoRC Jun 15 '21 at 09:57