2

I need to merge several files, removing redundant lines among files, while keeping redundant lines within files. A schematic representation of my files is the following:

File1.txt

1
2
3
3
4
5
6

File2.txt

6
7
8
8
9

File3.txt

9
10
10
11

The desired output would be:

1
2
3
3
4
5
6
7
8
8
9
10
10
11

I would prefer to get a solution either in awk, or in bash or in R language. I searched the web for solutions and, though there were plenty of them* (please find some examples below), there were all removing duplicated lines regardless of the fact that they were located within or outside files.

Thanks in advance. Arturo

Arturo
  • 342
  • 1
  • 4
  • 14
  • `I searched the web for solutions`, request you to please do add them in your question(to avoid close votes and downvotes), as its highly encouraged on SO for original posters to add their efforts in form of code in their questions(not my downvote btw), thank you. – RavinderSingh13 Mar 31 '21 at 12:38
  • Thank you. I added a few examples of previous solutions that removed redundant lines from both within and outside files. – Arturo Mar 31 '21 at 12:44
  • 1
    what would you expect if `file3.txt` == `9 9 10 10 12`? do you keep the dual `9's` or remove both `'9's` because `file2.txt` also has a `'9`? – markp-fuso Mar 31 '21 at 12:46
  • @markp-fuso.Tkanks. I would keep the dual 9's. – Arturo Mar 31 '21 at 12:50
  • @Arturo ok, keep the dual 9's from `file3.txt` ... but then would you remove the single 9 from `file2.txt`? – markp-fuso Mar 31 '21 at 13:06
  • @markp-fuso. Yes, I would remove the single 9 from file2.txt. Thanks for your help in detailing my question. – Arturo Mar 31 '21 at 13:17
  • you may want to update the question with those details re: dual 9's in `file3.txt`, especially since the accepted answer no longer works if dual 9's are added to `file3.txt` – markp-fuso Mar 31 '21 at 13:19
  • @Arturo, IMHO but if you are keeping 9 from file3 and removing from file2 then it looks like opposite of current question's requirement :) Please clarify on this once, thank you. – RavinderSingh13 Mar 31 '21 at 13:20
  • ok, two more questions ... dual `9's` in both `file2.txt` and `file3.txt` ... output should have 2x `9's` or 4x `9's`? how does all of this extend to >2 files having singles/dupes/triples of numbers? – markp-fuso Mar 31 '21 at 13:23
  • @Arturo, IMHO, may be you could open a new question to your new requirement(which is being discussed here), so that users are not confused on this one(as per your title and description here), since you have already got answers on this one, thank you. – RavinderSingh13 Mar 31 '21 at 13:23
  • 1
    @ markp-fuso and @ RavinderSingh13. Thanks for your comments and sorry for not being clear from the beginning. However, I'll post a new updated question. Anyway, the bottom line of my request is that any redundant line coming from different files should be removed. – Arturo Mar 31 '21 at 13:48
  • @Arturo, for your quote `any redundant line coming from different files should be removed` is taken care by answers posted here, surely please post new question with full details we will be Happy to guide there cheers :) – RavinderSingh13 Mar 31 '21 at 13:51

2 Answers2

6

With your shown samples, could you please try following. This will NOT remove redundant lines within files but will remove them file wise.

awk '
FNR==1{
  for(key in current){
    total[key]
  }
  delete current
}
!($0 in total)
{
  current[$0]
}
' file1.txt file2.txt  file3.txt

Explanation: Adding detailed explanation for above.

awk '                                ##Starting awk program from here.
FNR==1{                              ##Checking condition if its first line(of each file) then do following.
  for(key in current){               ##Traverse through current array here.
    total[key]                       ##placing index of current array into total(for all files) one.
  }
  delete current                     ##Deleting current array here.
}
!($0 in total)                       ##If current line is NOT present in total then do following.
{
  current[$0]                        ##Place current line into current array.
}
' file1.txt file2.txt  file3.txt     ##Mentioning Input_file names here.
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
6

Here's a trick adding on to https://stackoverflow.com/a/15385080/3358272 using diff and its output format. There is likely a presumption of "sorted" here, untested.

out=$(mktemp -p .)
tmpout=$(mktemp -p .)
trap 'rm -f "${out}" "${tmpout}"' EXIT
for F in ${@} ; do
    { cat "${out}" ;
      diff --changed-group-format='%>' --unchanged-group-format='' "${out}" "${F}" ;
    } > "${tmpout}"
    mv "${tmpout}" "${out}"
done
cat "${out}"

Output:

$ ./question.sh F*
1
2
3
3
4
5
6
7
8
8
9
10
10
11

$ diff <(./question.sh F*) Output.txt

(Per markp-fuso's comment, if File3.txt had two 9s, this would preserve both.)

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    Thanks for suggesting to look into this. I must admit that I didn't consider this possibility. – Arturo Mar 31 '21 at 13:10