3

I have a text file with contents that may be duplicates. Below is a simplified representation of my txt file. text means a unique character or word or phrase). Note that the separator ---------- may not be present. Also, the whole content of the file consists of unicode Japanese and Chinese characters.

EDITED

sometext1
sometext2
sometext3
aaaa
sometext4
aaaa
aaaa
bbbb
bbbb
cccc
dddd
eeee
ffff
gggg
----------
sometext5
eeee
ffff
gggg
sometext6
sometext7:cccc
sometext8:dddd
sometext9
sometext10

What I want to achieve is to keep only the line with the last occurrence of the duplicates like so:

sometext1
sometext2
sometext3
sometext4
aaaa
bbbb
sometext5
eeee
ffff
gggg
sometext6
sometext7:cccc
sometext8:dddd
sometext9
sometext10

The closest I found online is How to remove only the first occurrence of a line in a file using sed but this requires that you know which matching pattern(s) to delete. The suggested topics provided when writing the title gives Duplicating characters using sed and last occurence of date but they didn't work.

I am on a Mac with Sierra. I am writing my executable commands in a script.sh file to execute commands line by line. I'm using sed and gsed as my primary stream editors.

Char
  • 105
  • 11
  • 7
    how do you define `duplicates` ? – Kent Oct 17 '17 at 12:59
  • 4
    You example is unclear. Please explain how you see your input mapping to that output. – randomir Oct 17 '17 at 13:11
  • 1
    Why have ccc and ddd disappeared? – 123 Oct 17 '17 at 13:12
  • @Kent duplicates mean exactly a string of characters. So in my example, eg. cccc might mean `brown fox` (including the whitespace) appearing in various lines in the file. – Char Oct 17 '17 at 16:31
  • @randomir Notice that in my example, the first two `aaaa` would be removed, as I only want to keep the last `aaaa` which appears after 4 `text`. Also, the first `cccc` (the whole line) is removed, because the last `cccc` appears in the line `text:cccc`. – Char Oct 17 '17 at 16:34
  • Why is `bbbb` printed but `----------` not? Why is `text` not de-duplicated? – dawg Oct 18 '17 at 20:53
  • @dawg Yes, you're right. If `----------`, then it should remain. I edited my question to make it clearer hopefully. `bbbb` should appear twice. `text` is changed to `sometext#` to make it clear that they are unique. – Char Oct 20 '17 at 03:03
  • Are you expecting a conversion of `sometextXXX` to `text` as you now show? – dawg Oct 20 '17 at 03:32
  • @dawg No, sometext# remains in the final output, in the same order that it appears. – Char Oct 20 '17 at 04:04

5 Answers5

5

I am not sure if your intent is to preserve the original order of the lines. If that is the case, you could do this:

export LC_ALL=en_US.utf8 # to handle unicode characters in file
nl -n rz -ba file | sort -k2,2 -t$'\t' | uniq -f1 | sort -k1,1 | cut -f2
  • nl -n rz -ba file adds zero padded line numbers to the file
  • sort -k2,2 -t'$\t' sorts the output of nl by the second field (note that nl puts a tab after the line number)
  • uniq -f1 removes the duplicates, while ignoring the line number field (-f1)
  • the final sort restores the original order of the lines, with duplicates removed
  • cut -f2 removes the line number field, restoring the content to the original format
codeforester
  • 39,467
  • 16
  • 112
  • 140
  • 1
    I'm not saying this is *wrong* since I think the example given is ambiguous. However, this is substantially different output the example given... The OP needs to clarify what is the output desired and the reasoning to give a correct answer. – dawg Oct 19 '17 at 21:33
  • 1
    And it is also a nice decorate, sort, undecorate pipe btw. – dawg Oct 19 '17 at 21:54
  • 1
    @codeforester This is great! Easy to understand from reading the man pages. Adding the leading line numbers is cool. I was able to remove the duplicated lines (ie. exactly the same, eg. only `sometext7:cccc` and `sometext7:cccc`) using this method. But it won't work in this case, ie. Line10 `cccc` and Line21 `sometext7:cccc`, the `cccc` is repeated in part of another line. `cccc` should be removed and `sometext7:cccc` kept. But thanks for pointing out nl, sort and uniq for me! – Char Oct 20 '17 at 04:09
  • Your question doesn't define what exactly is considered a duplicate. My solution assumes it is the whole line. – codeforester Oct 20 '17 at 04:13
1

This awk is very close.

Given:

$ cat file
sometext1
sometext2
sometext3
aaaa
sometext4
aaaa
aaaa
bbbb
bbbb
cccc
dddd
eeee
ffff
gggg
----------
sometext5
eeee
ffff
gggg
sometext6
sometext7:cccc
sometext8:dddd
sometext9
sometext10

You can do:

$ awk 'BEGIN{FS=":"} 
        FNR==NR {for (i=1; i<=NF; i++) {dup[$i]++; last[$i]=NR;} next}
        /^$/ {next}
        {for (i=1; i<=NF; i++) 
            if (dup[$i] && FNR==last[$i]) {print $0; next}}
        ' file file
sometext1
sometext2
sometext3
sometext4
aaaa
bbbb
----------
sometext5
eeee
ffff
gggg
sometext6
sometext7:cccc
sometext8:dddd
sometext9
sometext10
dawg
  • 98,345
  • 23
  • 131
  • 206
0

This might work for you (GNU sed):

sed -r '1h;1!H;x;s/([^\n]+)\n(.*\1)$/\2/;s/\n-+$//;x;$!d;x' file

Store the first line in the hold space (HS) and append every subsequent line. Swap to the HS and remove any duplicate line that matches the last line. Also delete any separator lines and then swap back to the pattern space (PS). Delete all but the last line, which is swapped with the HS and printed out.

potong
  • 55,640
  • 6
  • 51
  • 83
0

I found a simpler solution but it sorts file in the process. So if u don't mind output in sort format then u can use the following:

$sort -u input.txt > output.txt

Note: the u flag sort the lines of the file listing unique lines.

v.j
  • 186
  • 11
0

Like in the uniq manual:

cat input.txt | uniq -d
Eric Aya
  • 69,473
  • 35
  • 181
  • 253
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Oct 02 '21 at 20:43