1

enter image description here

I have a big dataset and for one of the columns that is dependent upon another column, I have to take the average as I go. For example. Row 1 average then when I go to row 2 I want the average of rows 1 and 2 and so on. Moreover, I have non-integers in my column and I only want to take the average of integers and I want it to skip those non-integers without affecting my average. (Include those rows but not include it in the average). I was thinking of writing a for loop but I got confused and don't know how to implement it. Any help would be greatly appreciated.

The picture shows part of the dataFrame. As you can see, the first column is based on Complete, NotComplete (and it's a daily compliance). I'm importing an Excel file and I try to modify it so it turns the complete/NotComplete into 100% and 0% respectively and add a new column to take the average. However, I want the average to be consistent and include every day and stops at the day that no Complete/NotComplete data is entered (That's the current day and the next day there will be data in there as people will complete/NotComplete the project). I need to keep track of completion of a task on a daily basis and finally, Add a last column indicating that if they go below 85% that they are "No Longer Compliant" but as long as they are above 85%, they are "Still Compliant". Hope this is clear enough.

  • 1
    When you say dataframe/dataset... is that a `pandas` DataFrame or something else? – Jon Clements Aug 25 '19 at 06:39
  • Yes Pandas dataframe – Michael Abbasi Aug 25 '19 at 06:58
  • 1
    Great... can you make some [edit]s to your post after having a read of https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples and then it'll also be worth adding the pandas tag to get the appropriate eyeballs on it. – Jon Clements Aug 25 '19 at 07:01

1 Answers1

1

I will try to make an example data frame, called df, and explain, what you can do.

>> import pandas as pd
>> df = pd.DataFrame(
    [
        {'Make': 'Skoda', 'Flag': 'no integer'}, 
        {'Make': 'Kia', 'Flag': 1},
        {'Make': 'BMW', 'Flag': 0}, 
        {'Make': 'Lexus', 'Flag': 1}, 
        {'Make': 'Mercedes', 'Flag': 0},
    ]
)
>> df = df[['Make', 'Flag']] # just changing columns order
>> print(df)
       Make        Flag
0     Skoda  no integer
1       Kia           1
2       BMW           0
3     Lexus           1
4  Mercedes           0

You can check the column types by the following:

>> df.dtypes
Make    object
Flag    object
dtype: object

I guess you have got the same column type in your column as I have got in my Flag column. However, I can make another column that would have only "integer" values and the rest would be NaNs. I can use isinstance() function to achieve that.

>> df['Flag_int'] = df['Flag'].apply(lambda flag: flag if isinstance(flag, int) else pd.np.NaN)
>> print(df)
       Make        Flag  Flag_int
0     Skoda  no integer       NaN
1       Kia           1       1.0
2       BMW           0       0.0
3     Lexus           1       1.0
4  Mercedes           0       0.0

Now, when you have your data frame ready, you can compute a mean values. In your case, you can use combination of expanding() and mean() methods. In my case, the result is in Flag_int_mean feature.

>> df['Flag_int_mean'] = df['Flag_int'].expanding().mean()
>> print(df)
       Make        Flag  Flag_int  Flag_int_mean
0     Skoda  no integer       NaN            NaN
1       Kia           1       1.0       1.000000
2       BMW           0       0.0       0.500000
3     Lexus           1       1.0       0.666667
4  Mercedes           0       0.0       0.500000

I believe it will be helpful.

Jaroslav Bezděk
  • 6,967
  • 6
  • 29
  • 46
  • Thank you for your help. I really appreciate it. However, the column I have only has 1 and 0 values and the rest of the data are not numbers (which will eventually turn into values as more data are added but for now I want the average to skip all the non-numbers and only include numbers. – Michael Abbasi Aug 25 '19 at 08:24
  • @MichaelAbbasi, please, read carefully answer to this [question](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and edit your question to avoid misunderstanding as ours was. Most importantly, please, add sample of your data and sample of result you want to achieve! – Jaroslav Bezděk Aug 25 '19 at 09:22
  • @MichaelAbbasi, I have just edited my answer. I hope it is what you want to achieve. If not, please, do not hesitate to say so. – Jaroslav Bezděk Aug 25 '19 at 09:39
  • Thank you so much for all your help. I've now added a picture and some more info. I really appreciate your time and help. – Michael Abbasi Aug 25 '19 at 19:00