Get word count from a file ignoring comment lines in python

Question

I am trying to count number of occurrences of a word from a file, using Python. But I have to ignore comments in the file.

I have a function like this:

def getWordCount(file_name, word):
  count = file_name.read().count(word)
  file_name.seek(0)
  return count

How to ignore where the line begins with a # ?

I know this can be done by reading the file line by line like stated in this question. Are there any faster, more Pythonian way to do so ?

Is it possible that a line contains content followed by comment? Like `foo # comment`? — Willem Van Onsem, Mar 06 '17 at 13:30
`file_name.read()` is not very Pythonic. `file_name` suggests this is a string with the file name but `.read()` suggests this is a file object. As for your question: have you considered reading the file [line by line](https://docs.python.org/3.3/tutorial/inputoutput.html#methods-of-file-objects)? — MB-F, Mar 06 '17 at 13:32
@kazemakase I am passing the file object, but cannot name it as file. hence I named it as `file_name` — ThisaruG, Mar 06 '17 at 13:34
Well you cannot count faster than looking at every word. Whether you do this line by line, or in bulk has some impact on performance, but in terms of big oh, all methods are at least *O(n)*... — Willem Van Onsem, Mar 06 '17 at 13:34
@kazemakase No I didn't. I just wanted to know whether there is a better way. — ThisaruG, Mar 06 '17 at 13:34
@WillemVanOnsem Oh, yes. Thanks for the help. I'll use line by line method then. — ThisaruG, Mar 06 '17 at 13:35
@ThisaruGuruge: if you however will query multiple words, you can use a `Counter` that simply stores the count of every word. In that case you do the counting step only once. The retrieve step can then be done in *O(1)*... — Willem Van Onsem, Mar 06 '17 at 13:36
@ThisaruGuruge oh, I missed your last sentence in the question, sorry :) You could probably use a regular expression to filter out comments but I'm not sure if that's worth the effort... — MB-F, Mar 06 '17 at 13:37

score 1 · Answer 1 · answered Mar 06 '17 at 13:38

You can do one thing just create a file that is not having the commented line then run your code Ex.

infile = file('./file_with_comment.txt')

newopen = open('./newfile.txt', 'w')
for line in infile :
    li=line.strip()
    if not li.startswith("#"):
        newopen.write(line)

newopen.close()

This will remove every line startswith # then run your function on newfile.txt

def getWordCount(file_name, word):
  count = file_name.read().count(word)
  file_name.seek(0)
  return count

score 1 · Answer 2 · answered Mar 06 '17 at 13:46

1

More Pythonian would be this:

def getWordCount(file_name, word):
  with open(file_name) as wordFile:
    return sum(line.count(word)
      for line in wordFile
      if not line.startswith('#'))

Faster (which is independent from being Pythonian) could be to read the whole file into one string, then use regexps to find the words not in a line starting with a hash.

answered Mar 06 '17 at 13:46

Alfe

56,346
20
107
159

Since Python comments allow whitespace before the '#' you should probably do `line.strip().startswith('#')`. – Hannes Ovrén Mar 06 '17 at 13:58

score 1 · Accepted Answer · answered Mar 06 '17 at 13:47

1

You can use a regular expression to filter out comments:

import re

text = """ This line contains a word. # empty
This line contains two: word word  # word
newline
# another word
"""

filtered = ''.join(re.split('#.*', text))
print(filtered)
#  This line contains a word. 
# This line contains two: word word  
# newline

print(text.count('word'))  # 5
print(filtered.count('word'))  # 3

Just replace text with your file_name.read().

answered Mar 06 '17 at 13:47

MB-F

22,770
4
61
116

OP stated that comments are lines beginning with a hash. You are also filtering out comments starting in the middle of a line (which is of course more typical for real-world examples of comments). – Alfe Mar 06 '17 at 13:50
2

@Alfe right. OP clarified in the comments that content lines can also be followed by a comment. – MB-F Mar 06 '17 at 13:53

Get word count from a file ignoring comment lines in python

3 Answers3