How to filter column results by showing only the most recent in bash

Question

I have the following columns

21:32:47 daemon DENIED: "Prog1" usera server82 (Licensed number of users already reached.)
21:32:48 daemon DENIED: "Prog1" usera server82 (Licensed number of users already reached.)
21:32:51 daemon DENIED: "Prog1" usera server39 (Licensed number of users already reached.)
21:58:38 daemon DENIED: "Prog2" userb server97 (User/host not on INCLUDE list for feature.)
21:58:38 daemon DENIED: "Prog2" userb server97 (User/host not on INCLUDE list for feature.)
21:58:38 daemon DENIED: "Prog3" userb server97 (User/host not on INCLUDE list for feature.)
21:58:40 daemon DENIED: "Prog2" userd server04 (User/host not on INCLUDE list for feature.)
22:35:59 daemon DENIED: "Prog2" userd server92 (User/host not on INCLUDE list for feature.)

What I would like to do, is to filter it and show only the non duplicate lines with the most recent time!
So the result should be like this:

21:32:48 daemon DENIED: "Prog1" usera server82 (Licensed number of users already reached.)
21:32:51 daemon DENIED: "Prog1" usera server39 (Licensed number of users already reached.)
21:58:38 daemon DENIED: "Prog2" userb server97 (User/host not on INCLUDE list for feature.)
21:58:38 daemon DENIED: "Prog3" userb server97 (User/host not on INCLUDE list for feature.)
21:58:40 daemon DENIED: "Prog2" userd server04 (User/host not on INCLUDE list for feature.)
22:35:59 daemon DENIED: "Prog2" userd server92 (User/host not on INCLUDE list for feature.)

As you may notice, there are some lines with the same user or program but in total all the lines are not identical because of a different server or time.

You want to keep the most recent copy of every line that is a duplicate of another (apart from the time)? — Etan Reisner, Dec 03 '14 at 16:07
possible duplicate of [How to list the last occurance of a specific string in Terminal](http://stackoverflow.com/questions/27230743/how-to-list-the-last-occurance-of-a-specific-string-in-terminal). The same principle apply — fredtantini, Dec 03 '14 at 16:10

Gilles Quénot · Accepted Answer · 2014-12-03T16:28:21.523

I take advantage of the uniqueness of arrays keys. The variant part is the hour, so I store it in the array value with the current line as an array key (without the hour).

$ awk '
    {hour=$1;$1="";arr[$0]=hour}
    END{for (a in arr) {print arr[a] a}}
' file.txt

Output :

21:32:48 daemon DENIED: "Prog1" usera server82 (Licensed number of users already reached.)
21:58:40 daemon DENIED: "Prog2" userd server04 (User/host not on INCLUDE list for feature.)
21:58:38 daemon DENIED: "Prog3" userb server97 (User/host not on INCLUDE list for feature.)
22:35:59 daemon DENIED: "Prog2" userd server92 (User/host not on INCLUDE list for feature.)
21:58:38 daemon DENIED: "Prog2" userb server97 (User/host not on INCLUDE list for feature.)
21:32:51 daemon DENIED: "Prog1" usera server39 (Licensed number of users already reached.)

Mark Reed · Answer 2 · 2014-12-04T11:27:07.243

sort -r file.txt | uniq -f 1 | tac

sort -r: sort the lines in reverse order by timestamp.
uniq -f 1: ignoring the timestamp, remove duplicate lines, leaving only the first-encountered occurrence of each. Since we sorted in reverse, that will be the most recent one.
tac: reverse the order of the lines, thus putting it back into forward order by timestamp.

Here's the output on your sample data:

21:32:48 daemon DENIED: "Prog1" usera server82 (Licensed number of users already reached.)
21:32:51 daemon DENIED: "Prog1" usera server39 (Licensed number of users already reached.)
21:58:38 daemon DENIED: "Prog2" userb server97 (User/host not on INCLUDE list for feature.)
21:58:38 daemon DENIED: "Prog3" userb server97 (User/host not on INCLUDE list for feature.)
21:58:40 daemon DENIED: "Prog2" userd server04 (User/host not on INCLUDE list for feature.)
22:35:59 daemon DENIED: "Prog2" userd server92 (User/host not on INCLUDE list for feature.)

You tagged this question Linux, so I used the GNU tac utility; if you were on a Mac or BSD system, you could use tail -r instead.

Unfortunately your solution is not working for me. I tried sort -r | uniq -f 1 | tail -r and the results weren't filtered — Fonten, Dec 04 '14 at 07:48
I thought you said you were on Linux - why `tail -r` instead of `tac`? And what do you mean "weren't filtered"? — Mark Reed, Dec 04 '14 at 11:27

repzero · Answer 3 · 2014-12-04T04:41:41.757

TAKING LOG FILE AND ELIMINATING ALL DUPLICATES THEN SORT WHATEVER IS REMAINING IN TIME ORDER

Here is a script you can use afterwards, I will discuss what each command is doing

SCRIPT

#!/bin/bash
rm -f "$2" 2> /dev/null
touch "$2"
cat "$1" > tmp
sort -r tmp > "$1"
rm -f tmp 2> /dev/null 
while read -r line; do
line_to_find=`echo "$line"|cut -d ' ' -f2- ` 
no_of_duplicated_lines=`grep "$line_to_find" "$1"|wc -l`
if [[  "$no_of_duplicated_lines" != @(1) ]]; then
matching_line_in_log_files=`grep "$line_to_find" "$2"`
if [  -z "$matching_line_in_log_files" ]; then
echo "$line" >> "$2"
fi
else
echo "$line" >> "$2"
fi
done < "$1"
cat "$2" > tmp
sort -r tmp > "$2"
rm -f tmp 2> /dev/null

HOW THE SCRIPT WORKS

my_script < log_file >   < new_log_file >

"path to script" "path to log file to be modified" "a new log file location"

[STEP 1] REMOVING ANY EXISTING FILES THAT I HAS THE NEW FILENAME I WANT TO CREATE

rm -f "$2" 2> /dev/null

[STEP 2] CREATE A NEW LOG FILE FOR NO DUPLICATES AND SORTING OLD LOG FILE MESSAGES ACCORDING TO RECENT TIME

Open old log file and redirect it to a temporary file then sort all messages according to time. Afterwards, redirect all sorted messages back to old log file

cat "$1" > tmp
sort -r tmp > "$1"
rm -f tmp 2> /dev/null

[STEP 3] READ EACH LINE IN YOUR OLD LOG FILE

This is done by using a while do loop to read line until end of file

while read -r line; do
............
............
done < "$1"

[STEP 4] MODIFY EACH LINE READ BY ELIMINATING TIMESTAMP AND SAVE TO A VARIABLE

We already sort messages according to recent timestamp

line_to_find=`echo "$line"|cut -d ' ' -f2- `

[STEP 5] SEARCH LOG FILE FOR NO. OF DUPLICATED LINES AND NO DUPLICATE LINES in OLD LOG THAT HAS THE VARIABLE WITHOUT TIMESTAMP SINCE WE ALREADY SORT MESSAGES ACCORDING TO TIMESTAMP

I first search for duplicated lines in old log. If duplicates are found, check new log file to see if the line was already been redirected to it.

no_of_duplicated_lines=`grep "$line_to_find" "$1"|wc -l`
if [[  "$no_of_duplicated_lines" != @(1) ]]; then
matching_line_in_log_files=`grep "$line_to_find" "$2"`
if [  -z "$matching_line_in_log_files" ]; then
echo "$line" >> "$2"
fi
else

[STEP 6] IF NO. OF LINES = 1, REDIRECT THE LINE TO NEW LOG FILE

If only one line is found in old log and it is not a duplicate just redirect the line from old log to new log.

echo "$line" >> "$2"
fi

[STEP 7] SORT YOUR NEW LOG FILE IN TIME STAMP ORDER- RECENT TO OLDEST open your new log and sort according to recent time and redirect to tmp file. Afterwards, redirect all sorted output from tmp file to new log. remove tmp file after process finished.

cat "$2" > tmp
sort -r tmp > "$2"
rm -f tmp 2> /dev/null

YOUR OUTPUT ACCORDING TO TIME ORDER

22:35:59 daemon DENIED: "Prog2" userd server92 (User/host not on INCLUDE list for feature.)
21:58:40 daemon DENIED: "Prog2" userd server04 (User/host not on INCLUDE list for feature.)
21:58:38 daemon DENIED: "Prog3" userb server97 (User/host not on INCLUDE list for feature.)
21:58:38 daemon DENIED: "Prog2" userb server97 (User/host not on INCLUDE list for feature.)
21:32:51 daemon DENIED: "Prog1" usera server39 (Licensed number of users already reached.)
21:32:48 daemon DENIED: "Prog1" usera server82 (Licensed number of users already reached.)

I know the script is a little long than other answers but it is a good approach to understanding the process — repzero, Dec 04 '14 at 04:46
Note also i add some edits to my previous answers so that the codes can give recent time stamp order output — repzero, Dec 04 '14 at 05:07
Thanks @Xorg, this one is also giving me the result i want, but as you mentioned is a bit big. Despite that, it is nice approach. — Fonten, Dec 04 '14 at 07:35

How to filter column results by showing only the most recent in bash

3 Answers3