1

I have sever.log file. My regular expression is extracting all the digits which is having 3 digits separated by dots. My code, out and desired is below

192.168.10.20 - - [18/Jul/2017:08:41:37 +0000] "PUT /search/tag/list HTTP/1.0" 200 5042 "http://cooper.com/homepage/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5342 (KHTML, like Gecko) Chrome/14.0.870.0 Safari/5342"
10.30.24.3 - - [18/Jul/2017:08:45:15 +0000] "POST /search/tag/list HTTP/1.0" 200 4939 "http://www.cole-brown.net/category/main/list/privacy/" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/5322 (KHTML, like Gecko) Chrome/14.0.843.0 Safari/5322"
98.5.45.3 - - [18/Jul/2017:08:45:49 +0000] "GET /apps/cart.jsp?appID=8471 HTTP/1.0" 200 4958 "http://knight-chase.com/post.jsp" "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_7_3; rv:1.9.6.20) Gecko/2013-11-03 17:44:01 Firefox/3.8"

My Code

import re
with open (r'C:\Users\ubuntu\Desktop\Tests\apache.log', 'r') as fr1:
    line1 = fr1.read()
regex = r"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"
#print(re.findall(regex, line1, re.DOTALL))
listofip = (re.findall(regex, line1))
result ={}
for i in listofip:
    result[i] = listofip.count(i)
result

My Output

{'192.168.10.20': 1,
 '14.0.870.0': 1,
 '10.30.24.3': 1,
 '14.0.843.0': 1,
 '98.5.45.3': 1,
 '1.9.6.20': 1}

Desired OutPut

{'192.168.10.20': 1,
 '10.30.24.3': 1,
 '98.5.45.3': 1}
  • 1
    Maybe you need `r'(?m)^\d{1,3}(?:\.\d{1,3}){3}\b'`? To only get the IP at the start of the line? See [regex demo](https://regex101.com/r/fh9Crf/1). – Wiktor Stribiżew Aug 16 '19 at 06:49
  • you gan use `split('--')`, to get values at the beginning of the line if the format is always the same. – Shijith Aug 16 '19 at 06:54
  • Or just iterate over the lines and split each and get the first item. Unless you may have lines that do not start with an IP. – Wiktor Stribiżew Aug 16 '19 at 06:55

2 Answers2

0

If you have IPs on each line you may simply read line by line and split them and get the first item:

#line1=r'''192.168.10.20 - - [18/Jul/2017:08:41:37 +0000] "PUT /search/tag/list HTTP/1.0" 200 5042 "http://cooper.com/homepage/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5342 (KHTML, like Gecko) Chrome/14.0.870.0 Safari/5342"
#10.30.24.3 - - [18/Jul/2017:08:45:15 +0000] "POST /search/tag/list HTTP/1.0" 200 4939 "http://www.cole-brown.net/category/main/list/privacy/" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/5322 (KHTML, like Gecko) Chrome/14.0.843.0 Safari/5322"
#98.5.45.3 - - [18/Jul/2017:08:45:49 +0000] "GET /apps/cart.jsp?appID=8471 HTTP/1.0" 200 4958 "http://knight-chase.com/post.jsp" "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_7_3; rv:1.9.6.20) Gecko/2013-11-03 17:44:01 Firefox/3.8"
#98.5.45.3 - - [18/Jul/2017:08:45:49 +0000] "GET /apps/cart.jsp?appID=8471 HTTP/1.0" 200 4958 "http://knight-chase.com/post.jsp" "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_7_3; rv:1.9.6.20) Gecko/2013-11-03 17:44:01 Firefox/3.8"'''
result ={}
with open (r'C:\Users\ubuntu\Desktop\Tests\apache.log', 'r') as fr1:
    for line in fr1:
        ip = line.split()[0]
        if ip in result:
            result[ip] += 1
        else:
            result[ip] = 1
print(result)
# => {'192.168.10.20': 1, '10.30.24.3': 1, '98.5.45.3': 2}

See the Python demo.

To only get the IP at the start of the line with regex you may use

r'(?m)^\d{1,3}(?:\.\d{1,3}){3}'

See the regex demo.

Note a better IP regex (see this reference) matching at the start of a line is

r'^(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(?:\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){3}'

Or even this one, considering you have a space after each IP:

r'^(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(?:\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){3}(?!\S)'

Details

  • (?m)^ - start of a line
  • \d{1,3} - 1 to 3 digits
  • (?:\.\d{1,3}){3} - three occurrences of . and 1 to 3 digits.

See the Python demo:

import re
line1=r'''192.168.10.20 - - [18/Jul/2017:08:41:37 +0000] "PUT /search/tag/list HTTP/1.0" 200 5042 "http://cooper.com/homepage/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5342 (KHTML, like Gecko) Chrome/14.0.870.0 Safari/5342"
10.30.24.3 - - [18/Jul/2017:08:45:15 +0000] "POST /search/tag/list HTTP/1.0" 200 4939 "http://www.cole-brown.net/category/main/list/privacy/" "Mozilla/5.0 (X11; Linux i686) AppleWebKit/5322 (KHTML, like Gecko) Chrome/14.0.843.0 Safari/5322"
98.5.45.3 - - [18/Jul/2017:08:45:49 +0000] "GET /apps/cart.jsp?appID=8471 HTTP/1.0" 200 4958 "http://knight-chase.com/post.jsp" "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_7_3; rv:1.9.6.20) Gecko/2013-11-03 17:44:01 Firefox/3.8"'''

rx = r"^\d{1,3}(?:\.\d{1,3}){3}\b"
listofip = re.findall(rx, line1, re.M)
result ={}
for ip in listofip:
    if ip in result:
        result[ip] += 1
    else:
        result[ip] = 1
print(result)
# => {'192.168.10.20': 1, '10.30.24.3': 1, '98.5.45.3': 1} 
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • when i read from the file i am getting only first line {'192.168.10.20': 1} –  Aug 16 '19 at 08:59
  • @sim Are you reading the file in using `with open (r'C:\Users\ubuntu\Desktop\Tests\apache.log', 'r') as fr1: line1=fr1.read()`? Then the second solution has to work for you. BUT make sure `line1` is in scope. – Wiktor Stribiżew Aug 16 '19 at 09:00
  • line1 is printing whole lines still i am getting first line –  Aug 16 '19 at 11:07
  • @sim Are you sure you are using my code? Did you use `re.M` if you are using a regex approach? – Wiktor Stribiżew Aug 16 '19 at 11:19
0

Your log file is a CSV file, and the IP address is in the first column. There is no point in using regex for this.

import csv

with open('apache.log', encoding='utf8') as logfile:
    reader = csv.reader(logfile, delimiter=' ')

    for row in reader:
        print(row[0])

outputs

192.168.10.20
10.30.24.3
98.5.45.3
Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • reader = csv.reader(logfile, delimiter=' ') resultcsv = {} for row in reader: #print(row[0]) if row[0] in resultcsv: resultcsv[row[0]] += 1 else: resultcsv[row[0]] = 1 resultcsv –  Aug 16 '19 at 09:16