1

I have a dataframe that has dtype=object, i.e. categorical variables, for which I'd like to have the counts of each level of. I'd like the result to be a pretty summary of all categorical variables.

To achieve the aforementioned goals, I tried the following:

(line 1) grab the names of all object-type variables

(line 2) count the number of observations for each level (a, b of v1)

(line 3) rename the column so it reads "count"

stringCol = list(df.select_dtypes(include=['object'])) # list object of categorical variables
a = df.groupby(stringCol[0]).agg({stringCol[0]: 'count'})
a = a.rename(index=str, columns={stringCol[0]: 'count'}); a
    count
v1  
a   1279
b   2382

I'm not sure how to elegantly get the following result where all string column counts are printed. Like so (only v1 and v4 shown, but should be able to print such results for a variable number of columns):

    count       count
v1           v4
a   1279     l  32
b   2382     u  3055
             y  549

The way I can think of doing it is:

  1. select one element of stringCol
  2. calculate the count of for each group of the column.
  3. store the result in a Pandas dataframe.
  4. store the Pandas dataframe in an object (list?)
  5. repeat
  6. if last element of stringCol is done, break.

but there must be a better way than that, just not sure how to do it.

user2205916
  • 3,196
  • 11
  • 54
  • 82

1 Answers1

1

I think simpliest is use loop:

df = pd.DataFrame({'A':list('abaaee'),
                   'B':list('abbccf'),
                   'C':[7,8,9,4,2,3],
                   'D':[1,3,5,7,1,0],
                   'E':[5,3,6,9,2,4],
                   'F':list('aacbbb')})

print (df)
   A  B  C  D  E  F
0  a  a  7  1  5  a
1  b  b  8  3  3  a
2  a  b  9  5  6  c
3  a  c  4  7  9  b
4  e  c  2  1  2  b
5  e  f  3  0  4  b

stringCol = list(df.select_dtypes(include=['object']))

for c in stringCol:
    a = df[c].value_counts().rename_axis(c).to_frame('count')
    #alternative
    #a = df.groupby(c)[c].count().to_frame('count')
    print (a)

   count
A       
a      3
e      2
b      1
   count
B       
b      2
c      2
a      1
f      1
   count
F       
b      3
a      2
c      1

For list of DataFrames use list comprehension:

dfs = [df[c].value_counts().rename_axis(c).to_frame('count') for c in stringCol]
print (dfs)

[   count
A       
a      3
e      2
b      1,    count
B       
b      2
c      2
a      1
f      1,    count
F       
b      3
a      2
c      1]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • yeah, this is much better. didn't know about those functions and hard to learn about their existence through google w/o knowing relevant key terms. is there a way to make them print side-by-side to avoid having to scroll through the results? relevant bc may have a number of columns on the order of 1x10^n where n>=2 – user2205916 Jun 30 '18 at 04:53
  • @user2205916 - Yes, but it is a bit complicated, check [this](https://stackoverflow.com/a/38801975/2901002) – jezrael Jun 30 '18 at 04:54
  • in order to use the solution in the link you posted, i'd have to create a dataframe for each iteration through the loop. then, i think i'd have to store each dataframe into another object (which would be?) – user2205916 Jun 30 '18 at 04:57
  • Yes, it is possible create list of DataFrames. Give me a sec – jezrael Jun 30 '18 at 04:58
  • thanks. almost done. the final solution will look like: `display_side_by_side(dfs[0], dfs[1], dfs[2])` what is a better way to enumerate all of the dataframes in `dfs`? i tried: `display_side_by_side([dfs[c] for c in range(len(dfs))])` but that won't work – user2205916 Jun 30 '18 at 05:15
  • OK, I am going test it. – jezrael Jun 30 '18 at 05:16
  • that is a great trick. is there a term or reference i can look up for more information on it? not sure what search terms to use – user2205916 Jun 30 '18 at 05:22
  • 1
    You can check [this](https://docs.python.org/3/tutorial/controlflow.html#unpacking-argument-lists) – jezrael Jun 30 '18 at 05:26
  • Last question: how to print with thousands separator? I found: https://stackoverflow.com/questions/1823058/how-to-print-number-with-commas-as-thousands-separators but can't get it to work with: `stringCol = list(df.select_dtypes(include=['object'])) # list object of categorical variables dfs = [df[c].value_counts().rename_axis(c).to_frame('count') for c in stringCol]` `display_side_by_side(*dfs)`. – user2205916 Jul 01 '18 at 02:12
  • You can use `dfs = [df[c].value_counts().rename_axis(c).to_frame('count').applymap("{:,}".format) for c in stringCol] ` – jezrael Jul 01 '18 at 04:49
  • great. how do i round decimal places for: `[df[c].value_counts(normalize=True).rename_axis(c).to_frame('count').applymap("{:,}".format) for c in stringCol]` ? I tried adding `.applymap(lambda x: round(x,1))` but it takes issue with the `value_counts` – user2205916 Jul 01 '18 at 05:09
  • @user2205916 - Hmmm, it seems round is not necessay, because count values, so nevet float, alwys get integer. Possible minimal value is 0, then, 1, 2 to N. But if need it join with format `dfs = [df[c].value_counts().rename_axis(c).to_frame('count').applymap(lambda x: "{:,}".format(round(x, 1))) for c in stringCol]` – jezrael Jul 01 '18 at 05:13
  • 1
    thx, you've been a great help perfecting this. `.value_counts(normalize=True)` actually produces floats because it gives proportions. i.e. 0.9 values are "true" and 0.1 values are "false" – user2205916 Jul 01 '18 at 05:28