Iterate pandas data frame for rows consists of arrays and compute a moving average based on condition

Question

I can't figure out a problem I am trying to solve. I have a pandas data frame coming from this:

date,       id,     measure,    result
2016-07-11, 31, "[2, 5, 3, 3]",     1
2016-07-12, 32, "[3, 5, 3, 3]",     1
2016-07-13, 33, "[2, 1, 2, 2]",     1
2016-07-14, 34, "[2, 6, 3, 3]",     1
2016-07-15, 35, "[39, 31, 73, 34]", 0
2016-07-16, 36, "[3, 2, 3, 3]",     1
2016-07-17, 37, "[3, 8, 3, 3]",     1

Measurements column consists of arrays in string format.

I want to have a new moving-average-array column from the past 3 measurement records, excluding those records where the result is 0. Past 3 records mean that for id 34, the arrays of id 31,32,33 to be used.

It is about taking the average of every 1st point, 2nd point, 3rd and 4th point to have this moving-average-array.

It is not about getting the average of 1st array, 2nd array ... and then averaging the average, no.

For the first 3 rows, because there is not enough history, I just want to use their own measurement. So the solution should look like this:

date,       id,     measure,    result .     Solution
2016-07-11, 31, "[2, 5, 3, 3]",     1,      "[2,   5, 3,   3]"
2016-07-12, 32, "[3, 5, 3, 3]",     1,      "[3,   5, 3,   3]"
2016-07-13, 33, "[2, 1, 2, 2]",     1,      "[2,   1, 2,   2]"
2016-07-14, 34, "[2, 6, 3, 3]",     1,      "[2.3, 3.6, 2.6, 2.6]"
2016-07-15, 35, "[39, 31, 73, 34]", 0,      "[2.3, 4, 2.6, 2.6]"
2016-07-16, 36, "[3, 2, 3, 3]",     1,      "[2.3, 4, 2.6, 2.6]"
2016-07-17, 37, "[3, 8, 3, 3]",     1,      "[2.3, 3, 2.6, 2.6]"

The real data is bigger. result 0 may repeat 2 or more times after each other also. I think it will be about keeping a track of previous OK results properly getting those averages. I spent time but I could not.

I am posting the dataframe here:

 mydict = {'date': {0: '2016-07-11',
      1: '2016-07-12',
      2: '2016-07-13',
      3: '2016-07-14',
      4: '2016-07-15',
      5: '2016-07-16',
      6: '2016-07-17'},
     'id': {0: 31, 1: 32, 2: 33, 3: 34, 4: 35, 5: 36, 6: 37},
     'measure': {0: '[2, 5, 3, 3]',
      1: '[3, 5, 3, 3]',
      2: '[2, 1, 2, 2]',
      3: '[2, 6, 3, 3]',
      4: '[39, 31, 73, 34]',
      5: '[3, 2, 3, 3]',
      6: '[3, 8, 3, 3]'},
     'result': {0: 1, 1: 1, 2: 1, 3: 1, 4: 0, 5: 1, 6: 1}}

df = pd.DataFrame(mydict)

Thank you for giving directions or pointing out how to.

your measure is a str or a list of int, double quotes indicates str? and you give the possibility to delete row with 0 as result? — Frenchy, Feb 23 '19 at 09:44
It is a string which you can parse and generate a list of numbers. Originally, they are floats like 3.34, 2.45. I typed in integers there for simplicity. We are not allowed to delete the rows with 0 results, we need them. — Silvana, Feb 23 '19 at 09:54

Daniel Labbe · Accepted Answer · 2019-02-24T08:46:51.960

1

Solution using only 1 for loop:

Considering the data:

mydict = {'date': {0: '2016-07-11',
      1: '2016-07-12',
      2: '2016-07-13',
      3: '2016-07-14',
      4: '2016-07-15',
      5: '2016-07-16',
      6: '2016-07-17'},
     'id': {0: 31, 1: 32, 2: 33, 3: 34, 4: 35, 5: 36, 6: 37},
     'measure': {0: '[2, 5, 3, 3]',
      1: '[3, 5, 3, 3]',
      2: '[2, 1, 2, 2]',
      3: '[2, 6, 3, 3]',
      4: '[39, 31, 73, 34]',
      5: '[3, 2, 3, 3]',
      6: '[3, 8, 3, 3]'},
     'result': {0: 1, 1: 1, 2: 1, 3: 1, 4: 0, 5: 1, 6: 1}}
df = pd.DataFrame(mydict)

I defined a simple function to calculate the means and return a list. Then, loop the dataframe applying the rules:

def calc_mean(in_list):
    p0 = round((in_list[0][0] + in_list[1][0] + in_list[2][0])/3,1)
    p1 = round((in_list[0][1] + in_list[1][1] + in_list[2][1])/3,1)
    p2 = round((in_list[0][2] + in_list[1][2] + in_list[2][2])/3,1)
    p3 = round((in_list[0][3] + in_list[1][3] + in_list[2][3])/3,1)
    return [p0, p1, p2, p3]

Solution = []
aux_list = []
for index, row in df.iterrows():
    if index in [0,1,2]:
        Solution.append(row.measure)
        aux_list.append([int(x) for x in row.measure[1:-1].split(', ')])
    else:
        Solution.append('[' +', '.join(map(str, calc_mean(aux_list))) + ']')
        if row.result > 0:
            aux_list.pop(0)
            aux_list.append([int(x) for x in row.measure[1:-1].split(', ')])        
df['Solution'] = Solution

The output is:

Please note that the result is rounded to 1 decimal place, a bit different from your desired output. Made more sense to me.

EDIT:

As a suggestion in the comments by @Frenchy, to deal with result == 0 in the first 3 rows, we need to change a bit the first if clause:

if index in [0,1,2] or len(aux_list) <3:
    Solution.append(row.measure)
    if row.result > 0:
        aux_list.append([int(x) for x in row.measure[1:-1].split(', ')])

edited Feb 24 '19 at 08:46

answered Feb 23 '19 at 11:16

Daniel Labbe

1,979
3
15
20

Thank you Daniel. This was really helpful. In fact my data is huge and very dirty. But your help cleared my way. – Silvana Feb 23 '19 at 21:17
Hi @Silvana , good to know that it was helpful. Could you please upvote and accept my answer if you think that it deserves it? – Daniel Labbe Feb 23 '19 at 21:32
Hi @daniel. I upvoted but the website says that it won't be visible because I don't have any reputation. To have reputation I need some upvotes they say. So, if you think that this question is a useful one for the community could you upvote the question? – Silvana Feb 24 '19 at 07:59
Thanks anyway. But I think even without reputation you are able to accept the answer... – Daniel Labbe Feb 24 '19 at 08:01
I really can not see anything like `accept the answer` here. I will keep checking and `accept` if the website lets me. Thank you. – Silvana Feb 24 '19 at 08:26
Thanks! :) I know it sounds needy to ask for this, but at the end of the day the reputation matters. – Daniel Labbe Feb 24 '19 at 08:34
the answer is smart, just one thing, a problem could exist if you have a 0(in Result) in one of 3 first row.. – Frenchy Feb 24 '19 at 08:38
You are correct, @Frenchy. Made an edit. Thanks for pointing this out. – Daniel Labbe Feb 24 '19 at 08:43

Ben.T · Answer 2 · 2019-02-23T12:23:58.163

1

You can use pd.eval to change from a str of list to a proper list only the part of the data in measure where result is not 0. Use rolling with mean and then shift to get the rolling average over the last 3 rows at the next row. Then map to str once your dataframe is changed to a list of list with values and tolist. Finally you just need to replace the first three rows and ffill the missing data:

df.loc[df.result.shift() != 0,'solution'] = list(map(str,
                              pd.DataFrame(pd.eval(df[df.result != 0].measure))
                                .rolling(3).mean().shift().values.tolist()))
df.loc[:2,'solution'] = df.loc[:2,'measure']
df.solution = df.solution.ffill()

edited Feb 23 '19 at 12:23

answered Feb 23 '19 at 11:51

Ben.T

29,160
6
32
54

Hi @Ben.T. Thank you. This really helped me a lot. Your approach is quite clean, works nicely. – Silvana Feb 24 '19 at 09:16
1

The only problem I had with your solution is coming from a reason related to this entry. `eval` is not working on my large data: https://stackoverflow.com/questions/48008191/attributeerror-pandasexprvisitor-object-has-no-attribute-visit-ellipsis-us – Silvana Feb 24 '19 at 10:20
@Silvana thanks for pointing this out. I guess then using one of the solution of the link you gave to replace `eval` could do it for larger data :) – Ben.T Feb 24 '19 at 16:02

score 1 · Answer 3 · answered Feb 23 '19 at 19:49

Here's another solution:

# get data to reproduce example
from io import StringIO
data = StringIO(""" 
    date;id;measure;result 
    2016-07-11;31;"[2,5,3,3]";1 
    2016-07-12;32;"[3,5,3,3]";1 
    2016-07-13;33;"[2,1,2,2]";1 
    2016-07-14;34;"[2,6,3,3]";1 
    2016-07-15;35;"[39,31,73,34]";0 
    2016-07-16;36;"[3,2,3,3]";1 
    2016-07-17;37;"[3,8,3,3]";1 
    """)  

df = pd.read_csv(data, sep=";")
df
# Out:
#          date  id        measure  result
# 0  2016-07-11  31      [2,5,3,3]       1
# 1  2016-07-12  32      [3,5,3,3]       1
# 2  2016-07-13  33      [2,1,2,2]       1
# 3  2016-07-14  34      [2,6,3,3]       1
# 4  2016-07-15  35  [39,31,73,34]       0
# 5  2016-07-16  36      [3,2,3,3]       1
# 6  2016-07-17  37      [3,8,3,3]       1  

# convert values in measure column to lists
from ast import literal_eval
dm = df['measure'].apply(literal_eval)

# apply rolling mean with period 2 and recollect values into list in column means
df["means"] = dm.apply(pd.Series).rolling(2, min_periods=0).mean().values.tolist()                            

df                                                                                                           
# Out: 
#          date  id        measure  result                     means
# 0  2016-07-11  31      [2,5,3,3]       1      [2.0, 5.0, 3.0, 3.0]
# 1  2016-07-12  32      [3,5,3,3]       1      [2.5, 5.0, 3.0, 3.0]
# 2  2016-07-13  33      [2,1,2,2]       1      [2.5, 3.0, 2.5, 2.5]
# 3  2016-07-14  34      [2,6,3,3]       1      [2.0, 3.5, 2.5, 2.5]
# 4  2016-07-15  35  [39,31,73,34]       0  [20.5, 18.5, 38.0, 18.5]
# 5  2016-07-16  36      [3,2,3,3]       1  [21.0, 16.5, 38.0, 18.5]
# 6  2016-07-17  37      [3,8,3,3]       1      [3.0, 5.0, 3.0, 3.0]

# moving window of size 3
df["means"] = dm.apply(pd.Series).rolling(3, min_periods=0).mean().round(2).values.tolist()
df
# Out: 
#             date  id        measure  result                        means
# 0  2016-07-11  31      [2,5,3,3]       1         [2.0, 5.0, 3.0, 3.0]
# 1  2016-07-12  32      [3,5,3,3]       1         [2.5, 5.0, 3.0, 3.0]
# 2  2016-07-13  33      [2,1,2,2]       1     [2.33, 3.67, 2.67, 2.67]
# 3  2016-07-14  34      [2,6,3,3]       1      [2.33, 4.0, 2.67, 2.67]
# 4  2016-07-15  35  [39,31,73,34]       0   [14.33, 12.67, 26.0, 13.0]
# 5  2016-07-16  36      [3,2,3,3]       1  [14.67, 13.0, 26.33, 13.33]
# 6  2016-07-17  37      [3,8,3,3]       1  [15.0, 13.67, 26.33, 13.33]

Thank you for your response. But, you are including those big measurement numbers like [39,31,73,34] into your computation for finding the average. That's the problem with your solution. Look how big your averages are. It is because you included the bad results. I hope this makes it clear. Thank you. — Silvana, Feb 24 '19 at 09:36

Iterate pandas data frame for rows consists of arrays and compute a moving average based on condition

3 Answers3