Trouble looping over a dataframe and generating summary statistics

Question

I have a dataframe including a variable (t_seg_size) and I want to segment that variable into even segments e.g. 0-1000000, 1000001-2000000 etc.) and then generate summary statistics for each segment.

The method I'm using is to iterate over the dataframe in chunks of the appropriate size, then generate the stats such as .std().

Here is the code:

for x in range (1000000, 200000000, 1000000):
    print(df3[(x-999999 < df3["t_seg_size"] < x)].t_seg_size.std())

So the loop should look for t_seg_size between (1) and (1000000) and generate the standard deviation. However, I receive the following error:

ValueError                                Traceback (most recent call last)
<ipython-input-65-ee3e9911be81> in <module>()
      2 #df3[df3["t_seg_size"] > 2000000].describe()
      3 for x in range (1000000, 200000000, 1000000):
----> 4     print(df3[(1000000 < df3["t_seg_size"] < x)].t_seg_size.std())

C:\Users\xxxx\AppData\Local\Continuum\Anaconda3\lib\site-    packages\pandas\core\generic.py in __nonzero__(self)
    696         raise ValueError("The truth value of a {0} is ambiguous. "
    697                          "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
--> 698                          .format(self.__class__.__name__))
    699 
    700     __bool__ = __nonzero__

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Any help would be greatly appreciated.

You want `print(df3[(df3["t_seg_size"] >= x-999999) & (df3["t_seg_size"] < x)].t_seg_size.std())` the error is highlighting the fact that comparing a scalar with an array is ambiguous, so to compare against an array you should use the bitwise operators `&` `|` `~` for `and` `or` `not` respectively, additionally the conditions need parentheses due to operator precedence — EdChum, May 20 '15 at 10:25
@EdChum many thanks for helping me again with my problems. If you have time, it would be great to know if there were any other ways of tackling this particular task - that is, splitting the dataframe according to certain ranges and generating stats. +I also want to plot the summary stats as a graph — JohnL_10, May 20 '15 at 10:33

EdChum · Accepted Answer · 2015-05-20T10:48:18.597

So the error in this line:

print(df3[(x-999999 < df3["t_seg_size"] < x)].t_seg_size.std())

is because it becomes ambiguous to compare a scalar with an array, because what if there is a single match or all but one match should it be True? To do this correctly you need to use the bitwise array operators &, | ~ for and or and not comparisons so this now becomes:

print(df3[(df3["t_seg_size"] >= x-999999) & (df3["t_seg_size"] < x)].t_seg_size.std())

The parentheses are needed due to operator precedence.

To be honest what you're doing looks OK to me, not sure where you want to store the stats but you could just add this to a list, create a series/df from it and plot it:

stats={'range':[], 'std':[]}
for x in range (1000000, 200000000, 1000000):
    print(df3[(df3["t_seg_size"] >= x-999999) & (df3["t_seg_size"] < x)].t_seg_size.std())
    stats['range'].append(x)
    stats['std'].append(df3[(df3["t_seg_size"] >= x-999999) & (df3["t_seg_size"] < x)].t_seg_size.std())

you should be able to plot this using pd.DataFrame(stats).plot()

cool, but I get a syntax error for `stats={'range'=[], 'std'=[]}` `SyntaxError: invalid syntax` — JohnL_10, May 20 '15 at 10:46

score 0 · Answer 2 · edited May 23 '17 at 12:16

0

Your problem looks to be very similar to this one. Try with numpy.logical_and it should solve the issue.

for x in range (1000000, 200000000, 1000000):
    print(df3[logical_and(df3["t_seg_size"] > x-999999, df3["t_seg_size"] < x)].t_seg_size.std())

edited May 23 '17 at 12:16

Community

1
1

answered May 20 '15 at 10:44

alec_djinn

10,104
8
46
71

JoeCondron · Answer 3 · 2015-05-20T11:37:04.207

Here's a suggestion using groupby that should make it considerably faster:

 grouped = df.groupby((df.t_seg_size / 1000000).round())
 grouped.t_seg_size.std()

This will give you the standard deviation for each segment in a DataFrame in a fraction of the time. Another advantage is that you can call many other function on grouped once the grouping is done, such as mean, median etc. You can easily plot the result calling .plot on the result.

Trouble looping over a dataframe and generating summary statistics

3 Answers3