How can I count the occurrences of a word in Python

Question

I am trying to create a python script that looks through a log file and tells us how many times the user bin appears so have I have this:

#open the auth.log for reading
myAuthlog=open('auth.log', 'r')
for line in myAuthlog:
    if re.match("(.*)(B|b)in(.*)", line):
        print line

this prints out the full lines e.g.

>>> Feb  4 10:43:14 j4-be02 sshd[1212]: Failed password for bin from 83.212.110.234 port 42670 ssh2

But I only want to produce the number of times e.g. the user attempted to log in 26 times

You don't need the surrounding `.*` unless you explicitly are trying to extract the surrounding text — OneCricketeer, Feb 18 '16 at 19:11
`count = sum(1 for line in myAuthlog if re.match("(.*)(B|b)in(.*)", line))` — Peter Wood, Feb 18 '16 at 19:12

score 1 · Accepted Answer · answered Feb 18 '16 at 19:14

1

count = 0
myAuthlog=open('auth.log', 'r')
for line in myAuthlog:
    if re.match("(.*)(B|b)in(.*)", line):
        count+=1
print count

answered Feb 18 '16 at 19:14

Harkamal Jot Kumar

141
8

timgeb · Answer 2 · 2016-02-18T19:26:34.470

Option 1:

If your file is not gigantic, you can use re.findall and get the length of the resulting list:

count = len(re.findall(your_regex, myAuthlog.read()))

Option 2:

If your file is very large, iterate over the lines in a generator expression and sum up the matches:

count = sum(1 for line in myAuthlog if re.search(your_regex, line))

Both options assume that you want to count the number of lines for which you get a match, as your sample code indicates. Option 1 also assumes that the username can appear once per line.

A note about your regex:

(.*)(B|b)in(.*) will also match strings like 'Carabinero', consider using word boundaries, i.e. \b(B|b)in\b.

ShadowRanger · Answer 3 · 2016-02-18T20:24:46.903

In addition to @cricket_007's comment (no need for .*, as long as you switch to re.search which doesn't implicitly insert a start of line anchor at the front), searching for bin with no other qualifiers is likely to get a lot of false positives. And using grouping parens makes the check more expensive (it has to store capture groups). Lastly, you should always use raw strings for regexes, or it will eventually bite you. Put together, you could use the regex with if re.search(r'\b[Bb]in\b', line): to enforce word boundaries, avoid unnecessary capture, and still do what you intend.

You could even optimize it a bit by pre-compiling the regex (Python caches compiled regular expressions, but it still involves executing Python level code to check the cache every time; a compiled object goes straight to C with no delays).

This lets you simplify to:

import re

# Compile and store bound method with useful name; use character classes
# to avoid capture of B/b, and word boundary assertions to avoid capturing
# longer words containing name, e.g "binary" when you want bin
hasuser = re.compile(r'\b[Bb]in\b').search

#open the auth.log for reading using with statement to close file deterministically
with open('auth.log') as myAuthlog:
    # Filter for lines with the specified user (in Py3, would need to wrap
    # filter in list or use sum(1 for _ in filter(hasuser, myAuthlog)) idiom
    loginattempts = len(filter(hasuser, myAuthlog))
print "User attempted to log in", loginattempts, "times"

How can I count the occurrences of a word in Python

3 Answers3