0

I'm trying to count the number of words in a list that start with each letter of the alphabet. I've tried numerous things and nothing seems to work. The end result should be something like this:

list = ['the', 'big', 'bad', 'dog']
a: 0
b: 2
c: 0
d: 1

I assume I should be doing something with dictionaries, right?

awesoon
  • 32,469
  • 11
  • 74
  • 99
Justin
  • 717
  • 1
  • 9
  • 15

4 Answers4

6
from collections import Counter
print Counter(s[0] for s in  ['the', 'big', 'bad', 'dog'])
# Counter({'b': 2, 't': 1, 'd': 1})

If you want the zeros, you can do this:

import string

di={}.fromkeys(string.ascii_letters,0)
for word in ['the', 'big', 'bad', 'dog']:
    di[word[0]]+=1

print di    

If you just want 'A' to count the same as 'a':

di={}.fromkeys(string.ascii_lowercase,0)
for word in ['the', 'big', 'bad', 'dog']:
    di[word[0].lower()]+=1
# {'a': 0, 'c': 0, 'b': 2, 'e': 0, 'd': 1, 'g': 0, 'f': 0, 'i': 0, 'h': 0, 'k': 0, 'j': 0, 'm': 0, 'l': 0, 'o': 0, 'n': 0, 'q': 0, 'p': 0, 's': 0, 'r': 0, 'u': 0, 't': 1, 'w': 0, 'v': 0, 'y': 0, 'x': 0, 'z': 0}

And you can combine those two:

c=Counter({}.fromkeys(string.ascii_lowercase,0))
c.update(s[0].lower() for s in  ['the', 'big', 'bad', 'dog'])
print c
# Counter({'b': 2, 'd': 1, 't': 1, 'a': 0, 'c': 0, 'e': 0, 'g': 0, 'f': 0, 'i': 0, 'h': 0, 'k': 0, 'j': 0, 'm': 0, 'l': 0, 'o': 0, 'n': 0, 'q': 0, 'p': 0, 's': 0, 'r': 0, 'u': 0, 'w': 0, 'v': 0, 'y': 0, 'x': 0, 'z': 0})
dawg
  • 98,345
  • 23
  • 131
  • 206
2
myList = ["the", "big", "bad", "dog"]
from string import ascii_lowercase
d = dict.fromkeys(ascii_lowercase, 0)
for item in myList:
    d[item[0]] += 1
print d

Output

{'a': 0, 'c': 0, 'b': 2, 'e': 0, 'd': 1, 'g': 0, 'f': 0, 'i': 0, 'h': 0, 'k': 0, 'j': 0, 'm': 0, 'l': 0, 'o': 0, 'n': 0, 'q': 0, 'p': 0, 's': 0, 'r': 0, 'u': 0, 't': 1, 'w': 0, 'v': 0, 'y': 0, 'x': 0, 'z': 0}
thefourtheye
  • 233,700
  • 52
  • 457
  • 497
1

Just as a note, I am showing an example from pandas, a third party library, to show some of the options out in the Python universe that differ from the standard collections or itertools type of built-in options. I consider this answer as a secondary, coloring answer - not the prime answer.

The website for Pandas is here:

http://pandas.pydata.org/

pandas is easily available with setup tools using:

$ pip install pandas

The purpose of pandas is fast and syntactically sweet data analysis, like you might expect in R or a spreadsheet program like Microsoft Excel. It's continually being developed by Wes McKinney and a small team of other contributors and is released at a BSD-level license - meaning it's generally free to use it in your own projects, commercial or otherwise, so long as you attribute properly.

One advantage pandas has is that it's syntax in this case is very clear (value_counts) and that it's implementation is very fast, much more so than native Python:

from pandas import Series

sample_list = ['the', 'big', 'bad', 'dog']
s = Series([word[0] for word in sample_list])
s.value_counts()

Returns:

b    2
d    1
t    1

Let's take:

In [19]: len(big_words)
Out[19]: 229779

One pandas implementation:

def count_first(words):
    s = Series([word[0] for word in words])
    return s.value_counts()

In [15]: %timeit count_first(big_words)
10 loops, best of 3: 29.6 ms per loop

The accepted answer above:

def counter_first(words):
   return Counter(s[0] for s in words)

%timeit counter_first(big_words)
10 loops, best of 3: 105 ms per loop

Significantly faster, even with the list conversion in the function. We're not being fair to pandas though by forcing the list conversion. Let's assume we started with a Series to approach this problem.

In [20]: s = Series([word[0] for word in words])

In [21]: %timeit s.value_counts()
1000 loops, best of 3: 406 µs per loop

That's a 258.6x speed up.

When would I consider using pandas instead of Counter?

A good example would be a spam classifier. If you were approaching a natural language processing problem and needed to analyze word choice by the relative prevalence of words starting with a single letter, and you were looking at thousands of emails and/or websites with millions of words, the speed up from using pandas would be significant.

The bottom line is that pandas is a more performant library, but will require a little bit of package management (Python or os-based) to obtain.

exogeographer
  • 349
  • 1
  • 5
  • The person who asked this question should most probably a rookie, and should be trying to understand python, so pandas is not the way to answer this question – HariHaraSudhan Dec 15 '13 at 15:21
  • Sure, that's the reason there's an accepted answer and there will be higher voted answers. It's entirely possible someone might try this with millions of words and wonder why it's so slow. I don't see what's wrong with providing a differently colored answer in the mix. – exogeographer Dec 15 '13 at 15:26
  • I suggest you edit your answer with explaining what pandas library is and link to documentation. In that case i will definitely change my opinion. – HariHaraSudhan Dec 15 '13 at 15:33
  • That's fair. Thanks for the productive back and forth! – exogeographer Dec 15 '13 at 15:43
  • +1 `Counter` is the canonical answer to this question. I never thought about using pandas, and I like this example and think it is illustrative for future visitors (like me!). The 3x speed up is a reminder that high level constructs need not trade readability for performance. – Prashant Kumar Dec 17 '13 at 17:17
1
In [63]: %%timeit
....: from collections import defaultdict
....: fq = defaultdict( int )
....: for word in words:
....:        fq[word[0].lower()] += 1
....:
10 loops, best of 3: 138 ms per loop


In [64]: %%timeit
....: from collections import Counter
....: r = Counter(word[0].lower() for word in words)
....:
1 loops, best of 3: 287 ms per loop

In [65]: len(words)
Out[65]: 235886

Source for words is from /usr/share/dict/words. For above demo, IPython timeit function is used.

In [68]: fq
Out[68]:defaultdict(<type 'int'>, {'a': 17096, 'c': 19901, 'b': 11070, 'e': 8736, 'd': 10896, 'g': 6861, 'f': 6860, 'i': 8799, 'h': 9027, 'k': 2281, 'j': 1642, 'm': 12616, 'l': 6284, 'o': 7849, 'n': 6780, 'q': 1152, 'p': 24461, 's': 25162, 'r': 9671, 'u': 16387, 't': 12966, 'w': 3944, 'v': 3440, 'y': 671, 'x': 385, 'z': 949})

I would suggest using defaultdict since it is straightforward approach and faster.

n [69]: %%timeit
....: d = {}
....: for word in words:
....:        key = word[0].lower()
....:        if key in d:
....:                d[key] += 1
....:        else:
....:                d[key] = 1
....:
1 loops, best of 3: 177 ms per loop

Normal approach also seems faster when compared to Counter but few extra lines of code.

Kracekumar
  • 19,457
  • 10
  • 47
  • 56
  • Ha ha ha!!! This would [usually be me](http://stackoverflow.com/a/20308657/298607) posting timings!!! :-) One thing of note: Counter recently received an overhaul. In Python 3.3.3, on the linked tests, `f5` -- the Counter went from being the slowest to the fastest by a lot. I depends on the amortization of setup time, the size of the data, number of keys, number of missing keys and Python version. – dawg Dec 15 '13 at 18:28