0

I'm trying to filter out all duplicates of a list, ignoring the first n columns, preferable using awk (but open for other implementations)

I've found a solution for a fixed number of columns, but as I don't know how many columns there will be, I need a range. That solution I've found here

For clarity: What I'm trying to achieve is an alias for history which will filter out duplicates, but leaves the history_id intact, preferably without messing with the order. The history is in this form

ID    DATE       HOUR     command
 5612  2019-07-25 11:58:30 ls /var/log/schaubroeck/audit/2019/May/
 5613  2019-07-25 12:00:22 ls /var/log/schaubroeck/         
 5614  2019-07-25 12:11:30 ls /etc/logrotate.d/                       
 5615  2019-07-25 12:11:35 cat /etc/logrotate.d/samba     
 5616  2019-07-25 12:11:49 cat /etc/logrotate.d/named 

So this command works for commands up to four arguments long, but I need to replace the fixed columns by a range to account for all cases:

history | awk -F "[ ]" '!keep[$4 $5 $6 $7]++'

I feel @kvantour is getting me on the right path, so I tried:

history | awk '{t=$0;$1=$2=$3=$4="";k=$0;$0=t}_[k]++' | grep cd

But this still yields duplicate lines

 1102  2017-10-27 09:05:07 cd /tmp/
 1109  2017-10-27 09:07:03 cd /tmp/
 1112  2017-10-27 09:07:15 cd nagent-rhel_64/
 1124  2017-11-07 16:38:50 cd /etc/init.d/
 1127  2017-12-29 11:13:26 cd /tmp/
 1144  2018-06-21 13:04:26 cd /etc/init.d/
 1161  2018-06-28 09:53:21 cd /etc/init.d/
 1169  2018-07-09 16:33:52 cd /var/log/
 1179  2018-07-10 15:54:32 cd /etc/init.d/
oneindelijk
  • 606
  • 1
  • 6
  • 18
  • 1
    Using `_` as a variable name rather than a meaningful word or even a letter does nothing to help the clarity of any program. Also `-F" "` is setting FS to the default value it was already set to. [edit] your question to provide concise, testable sample input and expected output so we can help you. – Ed Morton Oct 01 '19 at 15:56
  • If a single space is requested as field separator, use `-F"[ ]"`. The default value that @EdMorton mentions are any sequence of spaces and tabs. – kvantour Oct 01 '19 at 16:02
  • 1
    You can use uniq -f instead of awk. – user448810 Oct 01 '19 at 18:56
  • @EdMorton I agree about the `_`. I was just copying this from the linked question. – oneindelijk Oct 02 '19 at 06:59
  • @user448810 seems to be not working. I tried `history | uniq -f 3 | grep cd` (I also tried with 4 since there seems to be a space at the beginning and maybe the 1st column is counted as the 2nd) – oneindelijk Oct 02 '19 at 07:05
  • 2
    you said `!keep[$4 $5 $6 $7]++` worked and @kvantour showed `{...}!_[k]++` but you tried `{...}_[k]++`.instead (you dropped the `!`). – Ed Morton Oct 02 '19 at 15:05
  • 1
    @EdMorton Well spotted ! I made the same typo in my terminal ! I tried kvantour's solution after I marked Chris Maes' answer as my solution (copying it from the thread) and I didn't understand why it suddenly worked. Thanks for pointing that out, now I can continue on my awk journey with renewed confidence... – oneindelijk Oct 03 '19 at 06:46
  • I forgot to mention that some of our servers have a history of over 50.000 lines. How would I grade the performance between these two solutions ? – oneindelijk Oct 03 '19 at 08:33
  • With that small an input file (50k lines) any reasonable solution should work in the blink of an eye so relative performance won't matter unless you're repeating the command thousands of times in a row. – Ed Morton Oct 03 '19 at 13:36

2 Answers2

2

The command you propose will not work as you expect. Imagine you have two lines like:

a b c d 12 13 1
x y z d 1 21 31

Both lines will be considered duplicates as the key, used in the array _ is for both d12131.

This is probably what you are interested in:

$ history | awk '{t=$0;$1=$2=$3="";k=$0;$0=t}!_[k]++'

Here we store the original record in the variable t. Remove the first three fields of the record by assigning empty values to it. This will redefine the record $0 and store it in the key k. Then we reset the record to the value of t. We do the check with the key k which now holds all fields except the first 3.

note: setting the field separtor as -F" " will not set it to a single space, but to any seqence of blanks (spaces and tabs). This is also the default behaviour. If you want a single space, add -F"[ ]"

kvantour
  • 25,269
  • 4
  • 47
  • 72
  • That's a helpfull explanation. But it is still not working... I added the command and the results in my qustion, because the formatting is messed up, here in comments – oneindelijk Oct 02 '19 at 07:13
  • @oneindelijk when I execute my line on the input you present I obtain the correct result. So I cannot reproduce your results. Also you set `$4==""` while this should not be the case. – kvantour Oct 02 '19 at 07:59
  • I've tried with both $4== and without. Maybe the space at the beginning, makes the apparent 1st column actually the 2nd... – oneindelijk Oct 02 '19 at 08:26
  • This is redhat 6.10, maybe ? – oneindelijk Oct 02 '19 at 08:26
  • Although I'm going for the `sort -u -k4 | sort -n` option, I'm marking this answer as the solution, because I was (am) looking to learn more about awk. Thanks ! – oneindelijk Oct 02 '19 at 08:33
  • @oneindelijk If this answer did not help you out, you should accept the answer that did. An upvote is always welcome. – kvantour Oct 02 '19 at 09:39
2

you can use sort:

history | sort -u -k4
  • -u for unique
  • -k4 to sort only on all columns starting the fourth.

Running this on

 1102  2017-10-27 09:05:07 cd /tmp/
 1109  2017-10-27 09:07:03 cd /tmp/
 1112  2017-10-27 09:07:15 cd nagent-rhel_64/
 1124  2017-11-07 16:38:50 cd /etc/init.d/
 1127  2017-12-29 11:13:26 cd /tmp/
 1144  2018-06-21 13:04:26 cd /etc/init.d/
 1161  2018-06-28 09:53:21 cd /etc/init.d/
 1169  2018-07-09 16:33:52 cd /var/log/
 1179  2018-07-10 15:54:32 cd /etc/init.d/

yields:

 1124  2017-11-07 16:38:50 cd /etc/init.d/                                                                                                                                                                         
 1112  2017-10-27 09:07:15 cd nagent-rhel_64/                                                                                                                                                                      
 1102  2017-10-27 09:05:07 cd /tmp/                                                                                                                                                                                
 1169  2018-07-09 16:33:52 cd /var/log/

EDIT if you want to keep the order you might apply a second sort:

history | sort -u -k4 | sort -n
Chris Maes
  • 35,025
  • 12
  • 111
  • 136
  • 2
    A double sort might be better to keep the time-ordering – kvantour Oct 02 '19 at 07:53
  • 1
    Close, but the second sort should use numeric sort, so `history | sort -u -k4 | sort -n` is a working answer I was looking for. Thanks ! – oneindelijk Oct 02 '19 at 08:30
  • off course, silly of me, I tested only on your data which was all 4 digits long. If this was the answer you were looking for you should accept this answer though. If the other answer helped you, you should upvote it. – Chris Maes Oct 02 '19 at 08:39