6

I can return the frequency of all columns in a nice dataframe with a total column.

for column in df:     
    df.groupby(column).size().reset_index(name="total")

Count   total
0   1   423
1   2   488
2   3   454
3   4   408
4   5   343

Precipitation   total
0   Fine        7490
1   Fog         23
2   Other       51
3   Raining     808

Month   total
0   1   717
1   2   648
2   3   710
3   4   701

I put the loop in a function, but this returns the first column "Count" only.

def count_all_columns_freq(dataframe_x):
    for column in dataframe_x:
        return dataframe_x.groupby(column).size().reset_index(name="total")

count_all_columns_freq(df)

Count   total
0   1   423
1   2   488
2   3   454
3   4   408
4   5   343

Is there a way to do this using slicing or other method e.g. for column in dataframe_x[1:]:

Edison
  • 11,881
  • 5
  • 42
  • 50
  • I'm not sure I understood your code, but if you need to count values: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.value_counts.html – janluke Dec 18 '20 at 13:23
  • 1
    See edit. I can already count as coded in my question. I want to be able to customise the column selection via slicing or other method. Thank you. – Edison Dec 18 '20 at 13:25
  • 1
    Please include sample(s) in text form that we can copy and paste for helping you, we don't want to type everything in... check out [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – MarianD Dec 18 '20 at 13:27
  • Still not clear. Write the dataframe you want in output. Btw, I think value_count is all you need. – janluke Dec 18 '20 at 13:33
  • Well I want all frequency counts in their own dataframe like I already did above. You can see the three dataframes **Count**, **Precipitation** and **Month**. I tried `value_count` but `groupby` with the `size` and `reset_index` gives me a better dataframe. – Edison Dec 18 '20 at 13:36
  • To get the columns you need `df.columns`, do you want to put a conditions on which columns are selected? – Nathan Furnal Dec 18 '20 at 14:35
  • So you just want to return a list of dataframes. – janluke Dec 18 '20 at 14:51

2 Answers2

2

Based on your comment, you just want to return a list of dataframe:

def count_all_columns_freq(df):
    return [df.groupby(column).size().reset_index(name="total")
            for column in df]

You can select columns in many ways in pandas, e.g. by slicing or by passing a list of columns like in df[['colA', 'colB']]. You don't need to change the function for that.

Personally, I would return a dictionary instead:

def frequency_dict(df):
    return {column: df.groupby(column).size()
            for column in df}

# so that I could use it like this:
freq = frequency_dict(df)
freq['someColumn'].loc[value]

EDIT: "What if I want to count the number of NaN?"

In that case, you can pass dropna=False to groupby (this works for pandas >= 1.1.0):

def count_all_columns_freq(df):
    return [df.groupby(column, dropna=False).size().reset_index(name="total")
            for column in df]
janluke
  • 1,567
  • 1
  • 15
  • 19
  • 1
    I upvoted because I was able to call the function and select two columns with `count_all_columns_freq(df[['Count','Precipitation']])` as advertised. However slicing produced errors. Could you demonstrate how to implement slicing in the function to select columns? I tried `count_all_columns_freq(df[:3])`, but that limited the selection of the column values not the columns themselves. – Edison Dec 19 '20 at 04:44
  • The other issues is that does not count NaNs. Somehow maybe `unique()` could be included. – Edison Dec 19 '20 at 10:09
  • @Edison You should read the pandas tutorial before coming here asking people. You lack a basic knowledge of pandas that is learnable reading the "10 minutes to pandas" tutorial. It'd be far more efficient for you if you read at least the basics before trying stuff of start a project. Coming to you question, `df[:3]` selects the first three rows. Read [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html#selection) to learn how to select a subset of column by labels or indexes. Or even better, read the full "10 minutes to pandas". – janluke Dec 19 '20 at 11:52
  • If you want to count NaNs, you need to pass `dropna=False` to `groupby`. I'll update my answer. – janluke Dec 19 '20 at 11:58
  • `TypeError: groupby() got an unexpected keyword argument 'dropna'` – Edison Dec 20 '20 at 06:58
  • You need pandas>=1.1.0 for dropna. You are using an older version. – janluke Dec 20 '20 at 12:00
  • Out of curiosity, why did you say you prefer returning dictionary? It seems very limited in utility. All it can do is return a single column and a single value. Of course it allows you to narrow in on an exact value of a column, but you get that for free if you return the entire column which is a more common use case. It's during the overview of a column's values where you might make missing value discoveries. – Edison Dec 21 '20 at 02:58
  • The only difference between a dict and a list is that with the dict you can refer to items by name rather than by index. This makes the code more readable. In my function, I preferred to return a Series, because I see no value in returning a DataFrame with a "total" column. With my version, you can access the distinct values with `freq['col'].index` and the corresponding frequencies with `freq['col']`. With the list version, you access the counted values with `freq[i][col]` and the corresponding counts with `freq[i][col].total`. – janluke Dec 21 '20 at 13:21
1

You can create a dataframe from the grouped by sizes with concat and a bit of renaming.

First get the columns you want, for example :

cols = df.columns 

Then use concat to patch them together, define the keys as the columns (the new indices) and the names as "group" and "sizes", that's their displayed names.

res = pd.concat((df.groupby(col, dropna=False).size() for col in cols, keys=cols, names=["indices", "groups"])

Now, we want this set in a dataframe, not a series.

res = pd.DataFrame(res)

Finally, we rename the totals,

res = res.rename(columns={0 : "totals"})

Example :

import pandas as pd
import numpy as np
rng = np.random.default_rng() # random number generation

A = rng.choice(["a", "b", "c"], 50)
B = rng.choice(["e", "f", "d"], 50)
C = rng.choice(['1', '2', '3', '5', '11'], 50)

df = pd.DataFrame({"A":A, "B":B, "C":C})

cols = df.columns
res = pd.DataFrame(pd.concat((df.groupby(c, dropna=False).size() for c in cols),  
                             keys=cols, names=["indices", "groups"]))

res = res.rename(columns = {0 : "totals"})

Outputs :

              totals
indices groups        
 A      a          16
        b          17
        c          17
 B      d           9
        e          22
        f          19
 C      1          10
        11         16
        2           8
        3          10
        5           6

Creating the relevant function can be done as such :

def concat_groups(df, cols=None):
    if cols is None:
        cols = df.columns

    res = pd.DataFrame(pd.concat((df.groupby(c, dropna=False).size() for c in cols),  
                                keys=cols, names=["indices","groups"]))
    
    res = res.rename(columns = {0 : "totals"})

    return res

So in this case you can either input a dataframe and a list of columns you selected or input a dataframe with only the relevant columns.

Cheers

Nathan Furnal
  • 2,236
  • 3
  • 12
  • 25
  • This works extremely well. And it displays the nice looking dataframes as well not just the array output. I did the same above so I was hoping to keep the dataframes. One request. This function does not count NaNs. Is there a way to incorporate pandas `unique()`? – Edison Dec 19 '20 at 09:29
  • Ah I see. By default the column that belongs to the size is either blank or 0 in pandas. That's why you have to rename it. `{0 : "totals"}` is just the syntax for the rename method which is similar to the syntax for the `replace()` method. – Edison Dec 19 '20 at 10:41
  • Yes the syntax is the same, pandas dataframes work well with dictionaries to remap columns or values for examples. – Nathan Furnal Dec 19 '20 at 10:57
  • As for your other question, it's difficult to include Nan's, you have to come up with a scheme to replace them and then group on them. You can code them with a value that you don't use otherwise – Nathan Furnal Dec 19 '20 at 10:59
  • You could do `df.fillna('None')` for missing values in string columns and `df.fillna(-999)` for missing values in the number columns. Do note that you'll need to select to proper columns beforehand. – Nathan Furnal Dec 19 '20 at 11:23
  • @Edison or add `dropna=False` in the groupby call, this works with pandas >= 1.1.0 – Nathan Furnal Dec 20 '20 at 14:22