Extract a list of unique visitors from a log file

Question

I would like to extract from a list of log files (named access.log.*) that look like this

95.11.113.x - [15/Nov/2013:18:25:17 +0100] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
95.11.113.x - [15/Nov/2013:18:25:19 +0100] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
95.11.113.x - [15/Nov/2013:18:25:21 +0100] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
125.111.9.x - [15/Nov/2013:20:00:00 +0100] "GET /files/azeazzae.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
132.41.100.x - [16/Nov/2013:11:15:11 +0100] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
132.41.100.x - [16/Nov/2013:11:15:11 +0100] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
132.41.100.x - [16/Nov/2013:11:15:11 +0100] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"

a list of unique visitors (only one repetition per day) who visited /files/myfile.rar, i.e. :

95.11.113.x - [15/Nov/2013] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
132.41.100.x - [16/Nov/2013] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"

I tried to open files and look for the desired string /files/myfile.rar like this : Search for string in txt file Python, but I didn't achieve to test for "same IP adress" and repetitions.

What should I use to do this? Standard string search, one line after another (Search for string in txt file Python) ? Regexp?

PS : even better for future use (sorting per date, etc.) :

2013-11-15 - 95.11.113.x - "GET /files/myfile.rar HTTP/1.1"
2013-11-16 - 132.41.100.x - "GET /files/myfile.rar HTTP/1.1"
2013-11-17 ....

use regex to extract the ips into a `set` and you're good. I've done it with groovy/grails but it still may help: https://bitbucket.org/alfasin/log-analyzer and [*this page*](https://bitbucket.org/alfasin/log-analyzer/src/2805691175c1fcc1fda4cc7cf0b151cfe0029f93/grails-app/domain/netflix/Log.groovy) in specific — Nir Alfasi, Jan 07 '14 at 16:45
yes but I will only have the ips in a `set` and I won't have the dates anymore ? @alfasin — Basj, Jan 07 '14 at 16:46
By the way, is there a way in order to ask Apache to give FULL IP in the logs @alfasin, and not incomplete IP like `123.123.123.x` ? — Basj, Jan 07 '14 at 20:38

Sabuj Hassan · Accepted Answer · 2014-01-07T16:55:24.677

1

Here should be the algorithm for your python code:

1) Read each line from the file.
2) If the line contains the text /files/myfile.rar then
3) Parse the IP address from the line. You can use regex, or can use split before space for that.
4) Save the line into a dict() variable in python like this way visitors[ip] = line

When you are done, print the visitors for output.

Here is the sample code for 3) and 4) as your requested.

visitors = dict()
# this should be same for each line
line = '95.11.113.x - [15/Nov/2013]'
ip = line.split(" - ")[0]  # assuming it must have " - " in line
visitors[ip] = line

# finally when you are done with above things
for visitor in visitors:
    print visitors[visitor]

edited Jan 07 '14 at 16:55

answered Jan 07 '14 at 16:47

Sabuj Hassan

38,281
14
75
85

Thank you for your answer @SabujHassan. Can you give more details for 3) and 4), I'm new to such tools ! (PS : I didn't downvote!) – Basj Jan 07 '14 at 16:48
1

Read the file line by line using a loop. Then perform 2,3,4 on each line. – Sabuj Hassan Jan 07 '14 at 17:11

Burhan Khalid · Answer 2 · 2014-01-08T10:23:18.190

Here is how to get your answer sorted by date, that is - unique visitors per day who requested myfile.rar, for all files named access.log.*:

import glob

from collections import defaultdict

d = defaultdict(set)

for file in glob.glob('access.log.*'):
   with open(file) as log:
      for line in log:
          if len(line.strip()): # skips empty lines
              bits = line.split('-')
              ip = bits[0].strip()
              date = bits[1].split()[0][1:][:-9]
              url = bits[1].split()[3]
              if url == '/files/myfile.rar':
                  d[date].add(ip)

for date,values in d.iteritems():
  print('Total unique visits for {}: {}'.format(date, len(values))
  for ip in values:
     print(ip)

Thank you! There must be a small mistake because `log` has no `.split` ? We need to add a loop on each line ? — Basj, Jan 07 '14 at 19:53

score 0 · Answer 3 · answered Jan 07 '14 at 17:23

The answer below is the result of SabujHassan's answer method. I only post it for future use.

visitors = dict()

with open('access.log.52') as fp:
    for line in fp:
        if '/files/myfile.rar' in line:
            ip = line.split(" - ")[0]  # assuming it must have " - " in line
            visitors[ip] = line

for ip in visitors:
    print visitors[ip]

Extract a list of unique visitors from a log file

3 Answers3