remove duplicate lines with similar prefix

Question

I need to remove similar lines in a file which has duplicate prefix and keep the unique ones.

From this,

abc/def/ghi/
abc/def/ghi/jkl/one/
abc/def/ghi/jkl/two/
123/456/
123/456/789/
xyz/

to this

abc/def/ghi/jkl/one/
abc/def/ghi/jkl/two/
123/456/789/
xyz/

Appreciate any suggestions,

What did you try? Post your research efforts in to the question, even if they are trivial — Inian, Feb 07 '18 at 05:44
You will get a much more friendly reception and much better help here if you show what code you have tried so far and describe what problems you were having with it. Without code, your question looks like a request for free consulting and many people don't like that. — John1024, Feb 07 '18 at 05:46
On top of that ... How do you define a prefix? From what you wrote, line 1,2 and 3 all have the same prefix but your example says not. — kvantour, Feb 07 '18 at 07:06
Has my answer provided you with some insight? Does it work as expected? — Allan, Feb 07 '18 at 07:52
Apologies, I will post the effort next time. I've tried to play around and couldn't get around on how to get rid the duplicates. Allan's answer solved the problem, and so did kvantour's. Cheers. — armahalma, Feb 07 '18 at 11:26

score 4 · Answer 1 · edited Feb 09 '18 at 18:33

4

Answer in case reordering the output is allowed.

sort -r file | awk 'a!~"^"$0{a=$0;print}'

sort -r file : sort lines in revers this way longer lines with the same pattern will be placed before shorter line of the same pattern
awk 'a!~"^"$0{a=$0;print}' : parse sorted output where a holds the previous line and $0 holds the current line
- a!~"^"$0 checks for each line if current line is not a substring at the beginning of the previous line.
- if $0 is not a substring (ie. not similar prefix), we print it and save new string in a (to be compared with next line)

The first line $0 is not in a because no value was assigned to a (first line is always printed)

edited Feb 09 '18 at 18:33

kvantour

25,269
4
47
72

answered Feb 07 '18 at 19:54

shaiki siegal

392
3
10

Very nice solution. Does this also work if `a="/abc/xyz"` and `$0="/xyz"`? – kvantour Feb 07 '18 at 22:31
thanks :). to remove also "post_fix" you can use `rev` command to reverse all chars ,compare and `rev` back :). `rev file | sort -r | awk 'a!~$0{print;a=$0}' | rev` – shaiki siegal Feb 08 '18 at 08:25
1

No, I meant that you would also match strings not at the beginning. If you want `$0` not to be at the beginning of `a`, it should read `a!~"^"$0` – kvantour Feb 08 '18 at 08:44

score 2 · Accepted Answer · answered Feb 07 '18 at 05:58

A quick and dirty way of doing it is the following:

$ while read elem; do echo -n "$elem " ; grep $elem file| wc -l; done <file | awk '$2==1{print $1}'
abc/def/ghi/jkl/one/
abc/def/ghi/jkl/two/
123/456/789/
xyz/

where you read the input file and print each elements and the number of time it appears in the file, then with awk you print only the lines where it appears only 1 time.

Bach Lien · Answer 3 · 2018-02-07T18:08:16.257

Step 1: This solution is based on assumption that reordering the output is allowed. If so, then it should be faster to reverse sort the input file before processing. By reverse sorting, we only need to compare 2 consecutive lines in each loop, no need to search all the file or all the "known prefixes". I understand that a line is defined as a prefix and should be removed if it is a prefix of any another line. Here is an example of remove prefixes in a file, reordering is allowed:

#!/bin/bash

f=sample.txt                                 # sample data

p=''                                         # previous line = empty

sort -r "$f" | \
  while IFS= read -r s || [[ -n "$s" ]]; do  # reverse sort, then read string (line)
    [[ "$s" = "${p:0:${#s}}" ]] || \
      printf "%s\n" "$s"                     # if s is not prefix of p, then print it
    p="$s"
  done

Explainations: ${p:0:${#s}} take the first ${#s} (len of s) characters in string p.

Test:

$ cat sample.txt 
abc/def/ghi/
abc/def/ghi/jkl/one/
abc/def/ghi/jkl/two/
abc/def/ghi/jkl/one/one
abc/def/ghi/jkl/two/two
123/456/
123/456/789/
xyz/

$ ./remove-prefix.sh 
xyz/
abc/def/ghi/jkl/two/two
abc/def/ghi/jkl/one/one
123/456/789/

Step 2: If you really need to keep the order, then this script is an example of removing all prefixes, reordering is not allowed:

#!/bin/bash

f=sample.txt
p=''

cat -n "$f" | \
  sed 's:\t:|:' | \
  sort -r -t'|' -k2 | \
  while IFS='|' read -r i s || [[ -n "$s" ]]; do
    [[ "$s" = "${p:0:${#s}}" ]] || printf "%s|%s\n" "$i" "$s"
    p="$s"
  done | \
  sort -n -t'|' -k1 | \
  sed 's:^.*|::'

Explanations:

cat -n: numbering all lines
sed 's:\t:|:': use '|' as the delimiter -- you need to change it to another one if needed
sort -r -t'|' -k2: reverse sort with delimiter='|' and use the key 2
while ... done: similar to solution of step 1
sort -n -t'|' -k1: sort back to original order (numbering sort)
sed 's:^.*|::': remove the numbering

Test:

$ ./remove-prefix.sh 
abc/def/ghi/jkl/one/one
abc/def/ghi/jkl/two/two
123/456/789/
xyz/

Notes: In both solutions, the most costed operations are calls to sort. Solution in step 1 calls sort once, and solution in the step 2 calls sort twice. All other operations (cat, sed, while, string compare,...) are not at the same level of cost.

In solution of step 2, cat + sed + while + sed is "equivalent" to scan that file 4 times (which theorically can be executed in parallel because of pipe).

kvantour · Answer 4 · 2018-02-07T10:48:50.897

The following awk does what is requested, it reads the file twice.

In the first pass it builds up all possible prefixes per line
The second pass, it checks if the line is a possible prefix, if not print.

The code is:

awk -F'/' '(NR==FNR){s="";for(i=1;i<=NF-2;i++){s=s$i"/";a[s]};next}
           {if (! ($0 in a) ) {print $0}}' <file> <file>

You can also do it with reading the file a single time, but then you store it into memory :

awk -F'/' '{s="";for(i=1;i<=NF-2;i++){s=s$i"/";a[s]}; b[NR]=$0; next}
           END {for(i=1;i<=NR;i++){if (! (b[i] in a) ) {print $0}}}' <file>

Similar to the solution of Allan, but using grep -c :

while read line; do (( $(grep -c $line <file>) == 1 )) && echo $line;  done < <file>

Take into account that this construct reads the file (N+1) times where N is the amount of lines.

remove duplicate lines with similar prefix

4 Answers4

Linked