Merge and remove redundant lines among files

Question

I need to merge several files, removing redundant lines among files, while keeping redundant lines within files. A schematic representation of my files is the following:

File1.txt

File2.txt

File3.txt

The desired output would be:

I would prefer to get a solution either in awk, or in bash or in R language. I searched the web for solutions and, though there were plenty of them* (please find some examples below), there were all removing duplicated lines regardless of the fact that they were located within or outside files.

Thanks in advance. Arturo

Examples of previous solutions removing redundant lines both within and outside files: https://unix.stackexchange.com/questions/50103/merge-two-lists-while-removing-duplicates https://unix.stackexchange.com/questions/457320/combine-text-files-and-delete-duplicate-lines https://unix.stackexchange.com/questions/350520/awk-combine-two-big-files-and-remove-duplicated-lines https://unix.stackexchange.com/questions/257467/merging-2-files-and-keeping-the-one-duplicate

`I searched the web for solutions`, request you to please do add them in your question(to avoid close votes and downvotes), as its highly encouraged on SO for original posters to add their efforts in form of code in their questions(not my downvote btw), thank you. — RavinderSingh13, Mar 31 '21 at 12:38
Thank you. I added a few examples of previous solutions that removed redundant lines from both within and outside files. — Arturo, Mar 31 '21 at 12:44
what would you expect if `file3.txt` == `9 9 10 10 12`? do you keep the dual `9's` or remove both `'9's` because `file2.txt` also has a `'9`? — markp-fuso, Mar 31 '21 at 12:46
@Arturo ok, keep the dual 9's from `file3.txt` ... but then would you remove the single 9 from `file2.txt`? — markp-fuso, Mar 31 '21 at 13:06
@markp-fuso. Yes, I would remove the single 9 from file2.txt. Thanks for your help in detailing my question. — Arturo, Mar 31 '21 at 13:17
you may want to update the question with those details re: dual 9's in `file3.txt`, especially since the accepted answer no longer works if dual 9's are added to `file3.txt` — markp-fuso, Mar 31 '21 at 13:19
@Arturo, IMHO but if you are keeping 9 from file3 and removing from file2 then it looks like opposite of current question's requirement :) Please clarify on this once, thank you. — RavinderSingh13, Mar 31 '21 at 13:20
ok, two more questions ... dual `9's` in both `file2.txt` and `file3.txt` ... output should have 2x `9's` or 4x `9's`? how does all of this extend to >2 files having singles/dupes/triples of numbers? — markp-fuso, Mar 31 '21 at 13:23
@Arturo, IMHO, may be you could open a new question to your new requirement(which is being discussed here), so that users are not confused on this one(as per your title and description here), since you have already got answers on this one, thank you. — RavinderSingh13, Mar 31 '21 at 13:23
@ markp-fuso and @ RavinderSingh13. Thanks for your comments and sorry for not being clear from the beginning. However, I'll post a new updated question. Anyway, the bottom line of my request is that any redundant line coming from different files should be removed. — Arturo, Mar 31 '21 at 13:48
@Arturo, for your quote `any redundant line coming from different files should be removed` is taken care by answers posted here, surely please post new question with full details we will be Happy to guide there cheers :) — RavinderSingh13, Mar 31 '21 at 13:51

RavinderSingh13 · Accepted Answer · 2021-03-31T12:57:57.380

With your shown samples, could you please try following. This will NOT remove redundant lines within files but will remove them file wise.

awk '
FNR==1{
  for(key in current){
    total[key]
  }
  delete current
}
!($0 in total)
{
  current[$0]
}
' file1.txt file2.txt  file3.txt

Explanation: Adding detailed explanation for above.

awk '                                ##Starting awk program from here.
FNR==1{                              ##Checking condition if its first line(of each file) then do following.
  for(key in current){               ##Traverse through current array here.
    total[key]                       ##placing index of current array into total(for all files) one.
  }
  delete current                     ##Deleting current array here.
}
!($0 in total)                       ##If current line is NOT present in total then do following.
{
  current[$0]                        ##Place current line into current array.
}
' file1.txt file2.txt  file3.txt     ##Mentioning Input_file names here.

Great answer Ravinder Singh. I just started learning a new language. It's fun. — Amit Verma, Mar 31 '21 at 20:09

score 6 · Answer 2 · answered Mar 31 '21 at 12:56

Here's a trick adding on to https://stackoverflow.com/a/15385080/3358272 using diff and its output format. There is likely a presumption of "sorted" here, untested.

out=$(mktemp -p .)
tmpout=$(mktemp -p .)
trap 'rm -f "${out}" "${tmpout}"' EXIT
for F in ${@} ; do
    { cat "${out}" ;
      diff --changed-group-format='%>' --unchanged-group-format='' "${out}" "${F}" ;
    } > "${tmpout}"
    mv "${tmpout}" "${out}"
done
cat "${out}"

Output:

$ ./question.sh F*
1
2
3
3
4
5
6
7
8
8
9
10
10
11

$ diff <(./question.sh F*) Output.txt

(Per markp-fuso's comment, if File3.txt had two 9s, this would preserve both.)

Thanks for suggesting to look into this. I must admit that I didn't consider this possibility. — Arturo, Mar 31 '21 at 13:10

Merge and remove redundant lines *among* files

2 Answers2

Merge and remove redundant lines among files