1

I am using jupyter notebook first time. I tried to groupby one column of csv and get the count of the values. i got the below result with this code.

import pandas
pandas.read_csv('a.csv', sep=',')
df.groupby('name').name.count()
name
>Aa</TOPONYM>                 4
>Aachen</TOPONYM>             5
>Aartselaar</TOPONYM>         1
>Abadan</TOPONYM>             1
>Abaya</TOPONYM>              1
>Abba</TOPONYM>              12
>Abbey                        2
>Abbeydale</TOPONYM>          1
>Abbot</TOPONYM>              2
>Abbots                       3
>Abbotsford</TOPONYM>        22
>Abbotsinch</TOPONYM>         5
>Abbotts                      1
>Abel</TOPONYM>               1
>Aberchirder</TOPONYM>        2
>Aberdare</TOPONYM>           3
>Aberdeen                     1
>Aberdeen</TOPONYM>         163
>Aberdeenshire</TOPONYM>    286
>Aberdour</TOPONYM>           9
>Aberfan</TOPONYM>            1
>Aberfeldy</TOPONYM>         16
>Abergavenny</TOPONYM>        4
>Aberlady                     1
>Aberlady</TOPONYM>           3
>Abernethy</TOPONYM>          1
>Abertay                      1
>Abertillery</TOPONYM>        6
>Abha</TOPONYM>               2
>Abidjan</TOPONYM>           10
                           ... 
>Zakho</TOPONYM>             20
>Zakopane</TOPONYM>           1
>Zambezi                      2
>Zambezi</TOPONYM>            8
>Zambia</TOPONYM>            19
>Zamboanga</TOPONYM>          4
>Zandak</TOPONYM>             3
>Zanzibar</TOPONYM>          11
>Zaragosa</TOPONYM>           1
>Zaragoza</TOPONYM>           4
>Zeebrugge</TOPONYM>         28
>Zeeland</TOPONYM>            2
>Zemun</TOPONYM>              1
>Zenica</TOPONYM>            12
>Zermatt</TOPONYM>            5
>Zetland</TOPONYM>            1
>Zhizhong</TOPONYM>           1
>Zhongshan</TOPONYM>          2
>Zhuhai</TOPONYM>             1
>Zimbabwe</TOPONYM>         377
>Znamenskoye</TOPONYM>        1
>Zoetermeer</TOPONYM>         1
>Zola</TOPONYM>               1
>Zomba</TOPONYM>              3
>Zulu</TOPONYM>               1
>Zululand</TOPONYM>           2
>Zuni</TOPONYM>               2
>Zurich</TOPONYM>            86
>Zvornik</TOPONYM>            3
>Zwolle</TOPONYM>             1
Name: name, Length: 8585, dtype: int64

is it possible to get the counts alphabet by alphabet, first I should run the command with alphabet a and it should give all values with a then next b and so on. or if it's possible to get the values skipping starting 100 values.

My real data looks like this:

<TOPONYM    geonameid="2657540" lat="51.24827"  lon="-0.76389"  >Aldershot</TOPONYM>    
<TOPONYM    geonameid="3037854" lat="49.9"  lon="2.3"   >Amiens</TOPONYM>   
<TOPONYM    geonameid="6216857" lat="-43.59832" lon="171.55011" >Alaska</TOPONYM>   
<TOPONYM    geonameid="3037854" lat="49.9"  lon="2.3"   >Amiens</TOPONYM>   
<TOPONYM    geonameid="2759794" lat="52.37403"  lon="4.88969"   >Amsterdam</TOPONYM>    
<TOPONYM    geonameid="7216668" lat="28.0106"   lon="-82.1184"  >Alabama</TOPONYM>  
<TOPONYM    geonameid="5884078" lat="48.98339"  lon="-73.34907" >Ally</TOPONYM> 
<TOPONYM    geonameid="2507480" lat="36.7525"   lon="3.04197"   >Algiers</TOPONYM>  
<TOPONYM    geonameid="2759794" lat="52.37403"  lon="4.88969"   >Amsterdam</TOPONYM>    
<TOPONYM    geonameid="2759794" lat="52.37403"  lon="4.88969"   >Amsterdam</TOPONYM>    
Moizzy
  • 59
  • 8

1 Answers1

1

You can use select first letter by str[1] and then use value_counts:

df = pandas.read_csv('a.csv')

a = df['name'].str[0].value_counts().rename_axis('alph').reset_index(name='count')

Another solutions with groupby by second letter:

a = df['name'].groupby(df['name'].str[0]).count().reset_index(name='count')

a = df['name'].groupby(df['name'].str[0]).size().reset_index(name='count')
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 'SeriesGroupBy' object has no attribute 'counts'. this is the error that I received – Moizzy Nov 09 '17 at 12:11
  • I add solutions instead `df.groupby('name').name.count()`, so try remove it. – jezrael Nov 09 '17 at 12:13
  • I ran this neither it gave an error nor it gave any result. – Moizzy Nov 09 '17 at 12:18
  • oops, typo, `count` not `counts` – jezrael Nov 09 '17 at 12:18
  • I got it based on alphabets , is it possible to get based on values and in alphabetical order – Moizzy Nov 09 '17 at 12:27
  • OK, what is `print (df['name'].head(3).tolist())` ? – jezrael Nov 09 '17 at 12:27
  • or if you can tell me, if I am able to find the values with my own code that i mentioned in my question but using descending order and skipping 100 values of starting – Moizzy Nov 09 '17 at 12:29
  • I have an idea. Is possible add to question `print (df[['name']].head(10)` and then desired output from this first 10 rows? – jezrael Nov 09 '17 at 12:29
  • this gave me [nan, nan, 'Hejaz – Moizzy Nov 09 '17 at 12:31
  • for remove first 100 rows need `df = pandas.read_csv('a.csv', skiprows=range(1,101))` – jezrael Nov 09 '17 at 12:32
  • Hmm, then need `str[0]` instead `str[1]`, because first char is not `>`. – jezrael Nov 09 '17 at 12:33
  • `using descending order` do you need sort `df.groupby('name').name.count()` ? Then need `df.groupby('name').name.count().sort_values(ascending=False)` – jezrael Nov 09 '17 at 12:35
  • I have removed > from the data now, so first letter would be alphabet in this column – Moizzy Nov 09 '17 at 12:35
  • using descending order do you need sort df.groupby('name').name.count() ? Then need df.groupby('name').name.count().sort_values(ascending=False) yes it worked now I am looking to skip starting 50 rows of this result – Moizzy Nov 09 '17 at 12:37
  • Then need `df.groupby('name').name.count().sort_values(ascending=False).iloc[50:]` – jezrael Nov 09 '17 at 12:38
  • Thanks this worked. now there is one more question . – Moizzy Nov 09 '17 at 12:40
  • in this data I have one more column that contains different values for this column with location names. is it possible to sort these location names with that column values and get the counts? – Moizzy Nov 09 '17 at 12:41
  • Not sure if understand, can you add data sample, 5-6 rows? What is `sort these location names with that column values` ? – jezrael Nov 09 '17 at 12:44
  • Aldershot Amiens Alaska Amiens Amsterdam – Moizzy Nov 09 '17 at 12:48
  • Thanks, can you add this data to question, because bad formating in comments with desired output? – jezrael Nov 09 '17 at 12:49
  • It is one column only? And what is desired output? – jezrael Nov 09 '17 at 12:52
  • these are five different columns first with toponym, second with geoid, third with latitude, fourth with longitude and fifth with names. my desired output is to group by the name column and geoid column, because there are names with same geoid as well some different geoids, so I want to sort them out. – Moizzy Nov 09 '17 at 12:57
  • OK, then need `df.groupby(['name','geoid']).name.count().sort_values(ascending=False)‌​.iloc[50:]` – jezrael Nov 09 '17 at 12:59
  • sort_values() got an unexpected keyword argument 'ascening' error – Moizzy Nov 09 '17 at 13:03
  • i made some changes and it worked. is it possible to extract let's suppose only Amsterdam values? – Moizzy Nov 09 '17 at 13:06
  • I think need `b = df.xs('Amsterdam', level=0, axis=0)`, more info [`xs`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.xs.html) – jezrael Nov 09 '17 at 13:07
  • 'RangeIndex' object has no attribute 'get_loc_level' – Moizzy Nov 09 '17 at 13:25
  • If `Amsterdam` is value in column then use `df = df[df['name'] == 'Amsterdam']` – jezrael Nov 09 '17 at 13:26
  • and for check substrings `df = df[df['name'].str.contains( 'Amsterdam')]` – jezrael Nov 09 '17 at 13:27
  • with the first command it gave plain output without any rows. – Moizzy Nov 09 '17 at 13:32
  • It means no match, maybe traling space. So need second one. – jezrael Nov 09 '17 at 13:33
  • Hmmm, if value `Amsterdam` in column `name`, it is weird. If check `print (df['name'].head(20))` there are cities? – jezrael Nov 09 '17 at 13:37
  • Series([], Name: name, dtype: object) this is the output i received.. I think there is some error – Moizzy Nov 09 '17 at 13:39
  • ok it gave the list of starting 20 names now. but with the command we entered string containing Amsterdam. it shows ValueError: cannot index with vector containing NA / NaN values. – Moizzy Nov 09 '17 at 13:41
  • 1
    First means no data and for second need `df = df[df['name'].str.contains( 'Amsterdam', na=False)]` – jezrael Nov 09 '17 at 13:42
  • :) you are welcome! But please dont give me some likes like [this](https://meta.stackexchange.com/questions/126829/what-is-serial-voting-and-how-does-it-affect-me) ;) – jezrael Nov 09 '17 at 13:47
  • can you tell me if it's possible to widen the range of result? – Moizzy Nov 09 '17 at 15:34
  • more columns? Then use [this](https://stackoverflow.com/a/33375383/2901002) – jezrael Nov 09 '17 at 15:36
  • when I tried this import pandas df = pandas.read_csv('a.csv', sep=',') grouped=df.groupby('name').name.count() ax = grouped.sort_values(ascending=False) print (ax) it gave me result only 50 to 60 lines with ... in between – Moizzy Nov 09 '17 at 15:42
  • I am looking for all the names to appear in the result file – Moizzy Nov 09 '17 at 15:43
  • It is only not display, check [solutions](https://stackoverflow.com/q/16424493/2901002) – jezrael Nov 09 '17 at 15:45
  • import pandas df = pandas.read_csv('a.csv', sep=',') grouped=df.groupby('name').name.count() grouped = grouped.set_option('display.max_rows', 500) ax = grouped.sort_values(ascending=False) print (ax) I added this in my script and it gave error > AttributeError: 'Series' object has no attribute 'set_option' – Moizzy Nov 09 '17 at 15:54
  • 1
    typo, need `pandas.set_option` – jezrael Nov 09 '17 at 15:55
  • OK, go home, so no answering for comments ;) Good luck! – jezrael Nov 09 '17 at 15:58