0

I am looking to parse Microsoft DNS debugging log responses. The idea is to parse the domains and print a list of the number each domain occurs in the debug log. Typically I would use something like grep -v " R " log > tmp to first redirect all of the responses to a file. Then manually grep for domains like grep domain tmp. I assume there is a better way.

20140416 01:38:52 588 PACKET  02030850 UDP Rcv 192.168.0.10 2659 R Q [8281   DR SERVFAIL] A     (11)quad(3)sub(7)domain(3)com(0)
20140416 01:38:52 588 PACKET  02396370 UDP Rcv 192.168.0.5 b297 R Q [8281   DR SERVFAIL] A     (3)pk(3)sub(7)domain(3)com(0)
20140415 19:46:24 544 PACKET  0261F580 UDP Snd 192.168.0.2  795a   Q [0000       NOERROR] A     (11)tertiary(7)domain(3)com(0)
20140415 19:46:24 544 PACKET  01A47E60 UDP Snd 192.168.0.1 f4e2   Q [0001   D   NOERROR] A     (11)quad(3)sub(7)domain(3)net(0)

For the above data, something like the following output would be great:

domain.com 3
domain.net 1

This would indicate that the script or command found two query entries for domain.com. I am not concerned about tertiary or greater hosts being included in the calculation. A shell command or Python would be fine. Here's some pseudo code to hopefully drive the question home.

theFile = open('log','r')
FILE = theFile.readlines()
theFile.close()
printList = []
# search for unique queries and count them
for line in FILE:
    if ('query for the " Q " field' in line):
         # store until count for this uniq value is complete
         printList.append(line)

for item in printList:
    print item    # print the summary which is a number of unique domains
Astron
  • 1,211
  • 5
  • 20
  • 42

3 Answers3

1

Perhaps something like this? I'm no expert at regular expressions, but this should get the job done as I understand the format you're parsing.

#!/usr/bin/env python

import re

ret = {}

with open('log','r') as theFile:
    for line in theFile:
        match = re.search(r'Q \[.+\].+\(\d+\)([^\(]+)\(\d+\)([^\(]+)',line.strip())
        if match != None:
            key = ' '.join(match.groups())
            if key not in ret.keys():
                ret[key] = 1
            else:
                ret[key] += 1

for k in ret.keys():
    print '%s %d' % (k,ret[k])
Mostly Harmless
  • 887
  • 1
  • 9
  • 9
  • This seems to do the trick. How might you sort based on `key`? – Astron Apr 21 '14 at 15:32
  • `ret` is just a dictionary, so if you do something like `x = ret.keys()` followed by `x.sort()` (or any other method you would use on a list of domains to sort them), you should then be able to iterate over it with `for k in x:` – Mostly Harmless Apr 21 '14 at 15:41
1

How about this, a bit of a brute force:

>>> from collections import Counter
>>> with open('t.txt') as f:
...     c = Counter('.'.join(re.findall(r'(\w+\(\d+\))',line.split()[-1])[-2:]) for line in f)
... 
>>> for domain, count in c.most_common():
...    print domain,count
... 
domain(3).com(0) 3
domain(3).net(0) 1
Burhan Khalid
  • 169,990
  • 18
  • 245
  • 284
0

It doesn't quite meet the output you asked for, but would this work for you?

dns = [line.strip().split()[-1] for line in file(r"path\to\file").readlines() if "PACKET" in line]
domains = {}
for d in dns:
    if not domains.has_key(d):
        domains[d] = 1
    else:
        domains[d] += 1

for k, v in domains.iteritems():
    print "%s %d" % (k, v)
Eugene C.
  • 495
  • 4
  • 13
  • Very close but I just need domain and TLD combo, not the tertiary and/or quad. So just `(7)domain(3)com(0) 3` in this case. – Astron Apr 21 '14 at 15:02