0

I'm just trying to get a count of rows for a values in a given column, for example:

CSV Data:

'Occupation','data'
'Carpenter','data1'
'Carpenter','data2'
'Carpenter','data3'
'Painter','data1'
'Painter','data2'
'Programmer','data1'
'Programmer','data2'
'Programmer','data3'
'Programmer','data4'

Program:

filename = "./data/TestGroup.csv"

df = pd.read_csv(filename)
print(df.head())

print("Computing stats by HandRank... ")
df_stats = df[['data']].groupby(['Occupation']).agg(['count'])
# also tried:  df_stats = df[['Occupation']].groupby(['Occupation']).agg(['count'])
print(df_stats.head())

How can I get the count in a variable? does .groupby and .agg return another dataframe?

Output/Error:

  'Occupation'   'data'
0  'Carpenter'  'data1'
1  'Carpenter'  'data2'
2  'Carpenter'  'data3'
3    'Painter'  'data1'
4    'Painter'  'data2'
    Computing stats by HandRank... 
    Traceback (most recent call last):
      File "C:\Apps\PokerHandGenerator_Copy_not_Source\Server\TestPandasGroupBy.py", line 17, in <module>
        df_stats = df.groupby(['Occupation']).agg(['count'])
      File "C:\Apps\ProcessData\venv\lib\site-packages\pandas\core\frame.py", line 6714, in groupby
        return DataFrameGroupBy(
      File "C:\Apps\ProcessData\venv\lib\site-packages\pandas\core\groupby\groupby.py", line 560, in __init__
        grouper, exclusions, obj = get_grouper(
      File "C:\Apps\ProcessData\venv\lib\site-packages\pandas\core\groupby\grouper.py", line 811, in get_grouper
        raise KeyError(gpr)
    KeyError: 'Occupation'

The df.head() shows it is using "Occupation" as my column name.

NealWalters
  • 17,197
  • 42
  • 141
  • 251
  • More of same: KeyError: ('Occupation', 'data') Do I need two brackets or one, tried it both ways... What are we specifying before the .group by? Why not just df.groupby.... ? – NealWalters Feb 24 '21 at 04:09
  • lets try `df[['Occupation', 'data']].groupby(['Occupation']).agg(['count'])` Rationale is, we are calling a list of columns and grouping them. df[['data']] results in a Series and loses the column ocuppation. You therefore are applying count on a non existent column if you know what I mean – wwnde Feb 24 '21 at 04:09
  • Try run `df[['Occupation', 'data']]` versus `df[['data']]`. One is a dataframe and the other is a series or single column – wwnde Feb 24 '21 at 04:10
  • KeyError: "None of [Index(['Occupation', 'data'], dtype='object')] are in the [columns]" – NealWalters Feb 24 '21 at 04:12
  • I am lost to what you are trying out. – wwnde Feb 24 '21 at 04:13
  • Can try `df[['Occupation','data']].groupby(['Occupation'])['data'].count()` /`df[['Occupation','data']].groupby(['Occupation'])['data'].count().to_frame('data_count')` – wwnde Feb 24 '21 at 04:16
  • 2
    Got solution from Anurag Dabas below. – NealWalters Feb 24 '21 at 04:17
  • Cool, all the best – wwnde Feb 24 '21 at 04:18

1 Answers1

1

Pandas sees the first column as 'Occupation' not Occupation.

use this:-

df_stats = df.groupby("'Occupation'").agg(['count'])

instead of using this:-

df_stats = df[['data']].groupby(['Occupation']).agg(['count'])
NealWalters
  • 17,197
  • 42
  • 141
  • 251
Anurag Dabas
  • 23,866
  • 9
  • 21
  • 41
  • df_stats = df.groupby('Occupation').agg(['count']) gives same error: KeyError: 'Occupation' – NealWalters Feb 24 '21 at 04:10
  • ohh I get It what is the problem you are facing...now again check my solution i edited it – Anurag Dabas Feb 24 '21 at 04:14
  • Thanks, Yes: "df_stats = df.groupby("'Occupation'").agg(['count'])" worked! print("Version of Pandas:", pd. __version__) gives 1.2.1 I'm running PyCharm under Anaconda for this program. Probably time to update the library too! So what did that extra quote accomplish? groupby("'Occupation'") instead of groupby('Occupation') – NealWalters Feb 24 '21 at 04:16
  • 1
    Yeah that is because you have a column that is wrapped in `single quotes` i.e `'` so your column is `'Occupation'` not `Occupation` that's why `groupby("'Occupation'")` worked and `groupby('Occupation')` not worked – Anurag Dabas Feb 24 '21 at 04:19
  • Btw no need of updating `pandas`....or If you wants to update then you can...as I am using `pandas 1.1.4` and I am not currently facing any problem – Anurag Dabas Feb 24 '21 at 04:22
  • Ohhh! Maybe using quotechar="'" would fix by creating column names without the quotes. Will try that later. – NealWalters Feb 24 '21 at 04:25
  • Even You can also Rename your Columns by `df.columns=['Occupation','data']` or by `df.rename()` method – Anurag Dabas Feb 24 '21 at 04:28
  • My real world data has lots of columns, and I'm getting the same count over and over for each data column. Is there way to just get row count, instead of a count of each column? I could do iterrows and pick them out and only show what I want... – NealWalters Feb 24 '21 at 04:30
  • have a look on https://stackoverflow.com/questions/15943769/how-do-i-get-the-row-count-of-a-pandas-dataframe – Anurag Dabas Feb 24 '21 at 04:34