Get fourth to last line where a string occurs in a file

Question

I am currently searching through a log file that contains IP addresses.
Log example:

10.1.177.198 Tue Jun 19 09:25:16 CDT 2018
10.1.160.198 Tue Jun 19 09:25:38 CDT 2018
10.1.177.198 Tue Jun 19 09:25:36 CDT 2018
10.1.160.198 Tue Jun 19 09:25:40 CDT 2018
10.1.177.198 Tue Jun 19 09:26:38 CDT 2018
10.1.177.198 Tue Jun 19 09:27:16 CDT 2018
10.1.177.198 Tue Jun 19 09:28:38 CDT 2018

I can currently grab the IP address from the last line of the log. I can also search for all line numbers that have the same IP address.

If the last IP address in the log is listed 3 or more times in the log, how can I get the line number for the 3rd to last occurrence of that IP address?

For example, I want to get the line number for this line:

10.1.177.198 Tue Jun 19 09:26:38 CDT 2018

Or better yet, just print the entire line.

Here is an example of my code:

import re

def run():

    try:
        logfile = open('read.log', 'r')

        for line in logfile:  
            x1 = line.split()[0]
            for num, line in enumerate(logfile, 0):
                if x1 in line:
                    print("Found " + x1 + " at line:", num)

        print ('Last Line: ' + x1)

        logfile.close
    except OSError as e:
        print (e)

run()

I am listing all the line numbers where the specific IP address occurs.

print("Found " + x1 + " at line:", num)

I am wanting to print the line where "num" is the 3rd to last number in the list of line numbers.

My overall goal is to grab the IP address from the last line in the log file. Then check if it has previously been listed more than 3 times. If it has, I want to find the 3rd to last listing of the address and get the line number.(or just print the address and date listed on that line)

What is expected output. only one ip address or all ip addresses? — Venkata Gogu, Jun 19 '18 at 18:42
What if two ip address occurs more than 3 times. Do you need both ip address 3rd occurence from last ? — Venkata Gogu, Jun 19 '18 at 18:45
I just want to check if the last IP address in the log occurs more than 3 times. (or whatever amount I specify) — Kade Williams, Jun 19 '18 at 18:48
You can use dictionary(key-ipaddress,value:list of occurences) to track all line numbers and print the 3rd value from back of the list — Venkata Gogu, Jun 19 '18 at 18:52

Venkata Gogu · Accepted Answer · 2018-06-20T10:43:49.280

1

Track all the occurences and print the 3rd one from the last. Can be optimized by using heapq.

def run():
    try:
        logfile = open('log.txt', 'r')

        ip_address_line_number = dict()
        for index,line in enumerate(logfile,1):  
            x1 = line.split()[0]
            log_time = line.split()[4]
            if x1 in ip_address_line_number : 
                ip_address_line_number[x1].append((index,log_time))
            else:
                ip_address_line_number[x1] = [(index,log_time)]

        if x1 in ip_address_line_number and len(ip_address_line_number.get(x1,None)) > 2:
            print('Last Line: '+ ip_address_line_number[x1][-3].__str__())
        else:
            print(x1 + ' has 0-2 occurences')
        logfile.close
    except OSError as e:
        print (e)

run()

edited Jun 20 '18 at 10:43

answered Jun 19 '18 at 18:57

Venkata Gogu

1,021
10
25

You only need to find the 3rd from last, so there's no need to store every single entry you find. Tracking every single occurrence in a long log file could become very expensive, in terms of memory usage and time. – Thomas Cohn Jun 19 '18 at 19:15
1

Yes, that is why I said we can optimize it using `heapq of size 3` Also, In OP, it is mentioned that he can give any ip address to lookup. My code will provide an additional functionality to look up 3rd occurence from last from all the ip addresses – Venkata Gogu Jun 19 '18 at 19:17
I suggest, read the first comment for first answer. Atleast, I am storing indices only which wont take lot of space. – Venkata Gogu Jun 19 '18 at 19:22
This works for me. What if I also want to print the time on the line? I could make 'x2 = line.split()[4]' How would I include that in the print line? – Kade Williams Jun 19 '18 at 19:33
I have edited my answer to include that. Like this: `log_time = line.split()[4]` – Venkata Gogu Jun 19 '18 at 19:43
It looks like you don't need the `re` module. – pylang Jun 19 '18 at 23:57

pylang · Answer 2 · 2018-06-19T23:57:34.540

Another way to see this, if the file was read in reverse:

What is the line data for the third observation of the first ip?
In the file, there must be at least 3+1 observations of the first ip.

There are many tools that can offer even more simple code, but here is one flexible, general approach geared for memory efficiency. Roughly, let's:

read the file backwards
count up to 3+1 observations
return the last observation

Given

A file test.log

# test.log 
10.1.177.198 Tue Jun 19 09:25:16 CDT 2018
10.1.160.198 Tue Jun 19 09:25:38 CDT 2018
10.1.177.198 Tue Jun 19 09:25:36 CDT 2018
10.1.160.198 Tue Jun 19 09:25:40 CDT 2018
10.1.177.198 Tue Jun 19 09:26:38 CDT 2018
10.1.177.198 Tue Jun 19 09:27:16 CDT 2018
10.1.177.198 Tue Jun 19 09:28:38 CDT 2018

and code for a reverse_readline() generator, we can write the following:

Code

def run(filename, target=3, min_=3):
    """Return the line number and data of the `target`-last observation.

    Parameters
    ----------
    filename : str or Path
        Filepath or name to file.
    target : int
        Number of final expected observations from the bottom, 
        e.g. "third to last observation." 
    min_ : int
        Total observations must exceed this number.

    """
    idx, prior, data = 0, "", []    
    for i, line  in enumerate(reverse_readline(filename)):
        ip, text = line.strip().split(maxsplit=1)
        if i == 0:
            target_ip = ip
        if target == 0:
            idx, *data = prior
        if ip == target_ip:
            target -= 1                                      
            prior = i, ip, text

    # Edge case
    total_obs = prior[0]
    if total_obs < min_:
        print(f"Minimum observations was not met.  Got {total_obs} observations.")
        return None

    # Compute line number
    line_num = (i - idx) + 1                               # add 1 (zero-indexed)
    return  [line_num] + data

Demo

run("test.log")
# [5, '10.1.177.198', 'Tue Jun 19 09:26:38 CDT 2018']

Second to last observation:

run("test.log", 2)
# [6, '10.1.177.198', 'Tue Jun 19 09:27:16 CDT 2018']

Minimum required observations:

run("test.log", 2, 7)
# Minimum observations was not met.  Got 6 observations.

Add error handling as needed.

Details

Note: an "observation" is a line containing the targeted ip.

We iterate the memory efficient reverse_readline() generator.
The target_ip is determined from the "first" line of the reversed file.
We are only interested in the third observation, so we need not save all information. Thus as we iterate, we only temporarily save one observation at a time to prior (reducing memory consumption).
target is a counter that is decremented after each observation. When the target counter reaches 0, the prior observation is saved until the generator is exhausted.
prior is a tuple containing line data for the last observation of the target ip address, i.e. index, address and text.
The generator is exhausted to determine the total_observations and length of the file, which is used to compute the line_number.
The computed line number and line data is returned.

SpghttCd · Answer 3 · 2018-06-20T05:17:00.540

0

Using pandas this would be quite short:

import pandas as pd
df = pd.read_fwf('read.log', colspecs=[(None, 12), (13, None)], header=None, names=['IP', 'time'])

lastIP = df.IP[df.index[-1]]
lastIP_idx = df.groupby('IP').groups[lastIP]

n = 3
if len(lastIP_idx) >= n:
    print('\t'.join(list( df.loc[lastIP_idx[-n]] )))
else:
    print('occurence number of ' + lastIP + ' < ' + str(n))

edited Jun 20 '18 at 05:17

answered Jun 20 '18 at 05:00

SpghttCd

10,510
2
20
25

Get fourth to last line where a string occurs in a file

3 Answers3