How to count word frequencies within a file in python

Question

I have a .txt file with the following format,

C
V
EH
A
IRQ
C
C
H
IRG
V

Although obviously it's a lot bigger then that, this is essentially it.Basically I'm trying to sum how many times each individual string is in the file (each letter/string is on a separate line, so technically the file is C\nV\nEH\n etc. However when I try to convert these files into a list, and then use the count function on, it separates out letters so that strings such as 'IRQ' are ['\n'I','R','Q','\n'] so then when I count it I get the frequencies of each individual letter and not of the strings.

Here is the code that I have written so far,

def countf():
    fh = open("C:/x.txt","r")
    fh2 = open("C:/y.txt","w")
    s = []
    for line in fh:
        s += line
    for x in s:
        fh2.write("{:<s} - {:<d}".format(x,s.count(x))

What I want to end up with is an output file that looks something like this

C  10
V  32
EH 7
A  1
IRQ  9
H 8

Does it have to be done in python? `sort yourfile.txt | uniq -c` will give you word counts (you mention C:\ so you seem to be on windows, `sort` and `uniq` are standard unix commands that you can get if you install cygwin or http://unxutils.sourceforge.net/). — John Carter, Aug 24 '12 at 22:49
@therefromhere - I think the OP wants word counts. The python code is generating letter counts the way that it is written. `sort` and `uniq` will technically generate line counts. Not sure if this is correct or not. — D.Shawley, Aug 24 '12 at 22:54
Word counts, just some of those words happen to be composed of a single letter, it's for biological research. As for doing it in python, that and R are the only languages I'm familiar with and tbh I'd like to figure this out within python — TheFoxx, Aug 24 '12 at 22:54
@D.Shawley yeah sorry I misread - only had one coffee >< deleted my comment. — John Carter, Aug 24 '12 at 22:55
@therefromhere - 'word' needn't be 'english language word'. String would have been better for the OP to use, though. — selllikesybok, Aug 24 '12 at 22:55

Ashwini Chaudhary · Accepted Answer · 2012-08-24T22:58:13.020

6

use Counter(), and use strip() to remove the \n:

from collections import Counter
with open('x.txt') as f1,open('y.txt','w') as f2:
    c=Counter(x.strip() for x in f1)
    for x in c:
        print x,c[x]   #do f2.write() here if you want to write them to f2

output:

A 1
C 3
EH 1
IRQ 1
V 2
H 1
IRG 1

edited Aug 24 '12 at 22:58

answered Aug 24 '12 at 22:51

Ashwini Chaudhary

244,495
58
464
504

score 0 · Answer 2 · answered Aug 24 '12 at 22:49

Change s += line to s.extend(line.split()). The += operator is for adding two sequences together and the string is treated as a sequence of characters. You can either use list.append (e.g., s.append(line)) to add the entire line as a single entry in the list or use list.extend to add a list of strings.

In this case, I used line.split() to split the line into individual words and then added the list of words to the current list. If each line only contains a single word, then you can use s.append(line) instead.

score 0 · Answer 3 · edited Aug 24 '12 at 23:08

0

Ashwini's answer is good if you have Python 2.7 or 3.1, but 2.6 and 3.0 don't have collections.Counter.

For portability to these older versions, you may be better off using collections.defaultdict(int).

edited Aug 24 '12 at 23:08

Ashwini Chaudhary

244,495
58
464
504

answered Aug 24 '12 at 23:02

dstromberg

6,954
1
26
27

There is [a backport](http://code.activestate.com/recipes/576611/) of `collections.Counter` that is supposed to work with 2.5 and 2.6. Also, this answer should probably be a comment. – Karl Knechtel Aug 25 '12 at 00:25

How to count word frequencies within a file in python

3 Answers3

Linked

Related