0

I have a dataframe for which I'm looking at histograms of subsets of the data using column and by of pandas' hist() method, as in:

ax = df.hist(column='activity_count', by='activity_month')

(then I go along and plot this info). I'm trying to determine how to programmatically pull out two pieces of data: the number of records with that particular value of 'activity_month' as well as the value of 'activity_month' when I loop over the axes:

for i,x in enumerate(ax):`  
   print("the value of a is", a)
   print("the number of rows with value of a", b)

so that I'd get:

January 1002
February 4305
etc

Now, I can easily get the list of unique values of "activity_month", as well as a count of how many rows have a given value of activity_month equal to that,

a="January"
len(df[df["activity_month"]=a])

but I'd like to do that within the loop, for a particular iteration of i,x. How do I get a handle on the subsetted data within "x" on each iteration so I can look at the value of the "activity_month" and the number of rows with that value on that iteration?

ImportanceOfBeingErnest
  • 321,279
  • 53
  • 665
  • 712
JCQ
  • 623
  • 1
  • 5
  • 8

1 Answers1

0

Here is a short example dataframe:

import pandas as pd

df = pd.DataFrame([['January',19],['March',6],['January',24],['November',83],['February',23],
                    ['November',4],['February',98],['January',44],['October',47],['January',4],
                    ['April',8],['March',21],['April',41],['June',34],['March',63]],
                    columns=['activity_month','activity_count'])

Yields:

   activity_month  activity_count
0         January              19
1           March               6
2         January              24
3        November              83
4        February              23
5        November               4
6        February              98
7         January              44
8         October              47
9         January               4
10          April               8
11          March              21
12          April              41
13           June              34
14          March              63

If you want the sum of the values for each group from your df.groupby('activity_month'), then this will do:

df.groupby('activity_month')['activity_count'].sum()

Gives:

activity_month
April        49
February    121
January      91
June         34
March        90
November     87
October      47
Name: activity_count, dtype: int64

To get the number of rows that correspond to a given group:

df.groupby('activity_month')['activity_count'].agg('count')

Gives:

activity_month
April       2
February    2
January     4
June        1
March       3
November    2
October     1
Name: activity_count, dtype: int64

After re-reading your question, I'm convinced that you are not approaching this problem in the most efficient manner. I would highly recommend that you do not explicitly loop through the axes you have created with df.hist(), especially when this information is quickly (and directly) accessible from df itself.

rahlf23
  • 8,869
  • 4
  • 24
  • 54
  • thank you so much for taking the time to answer! that's not quite the question I'd asked, though (sorry!). I was asking how to get the info *within the context of the loop* over the enumerable. Your example DF is perfect, and the Month and counts output are correct given that DF. But the mechanism by which you got there is different than the question i'm asking – JCQ Aug 15 '18 at 00:30
  • Downvote is hardly necessary, I can edit my response. – rahlf23 Aug 15 '18 at 03:17
  • That was my fat fingers hitting the wrong icon :( . catch-22 though: it says my vote is locked unless there's an edit – JCQ Aug 15 '18 at 14:36
  • @JCQ see my edit. You should really not be looping through the subplot axes to query the data, this is what the dataframe is for. Not to mention that explicitly looping through subplot axes is an extremely inefficient manner to query for data you have already plotted. I would recommend storing this information prior to plotting if you intend to use it in conjunction with the plots themselves. – rahlf23 Aug 15 '18 at 15:01
  • 1
    here's my use-case argument: there's a histogram for each month. For each histogram I want the title to contain the month itself, ex: "January has 1,502 activities". I can do this as separate histograms with separate calls to `.hist()` for a given month. But if I use `hist()` arguments of column & by, it seems odd that I can get all the histograms "at once" but I can't (or ought not) customize the title. I can determine the color of the bars and the tickmarks and etc but I can't change the title to include a variable that is fixed for a given iteration over the grouped-bys ? Odd. – JCQ Aug 15 '18 at 19:57
  • I see, thank you for the clarification. Could this be what you're after? https://stackoverflow.com/q/8938449/8146556 – rahlf23 Aug 15 '18 at 20:01