comparing two files by lines and removing duplicates from first file

Question

Problem:

Need to compare two files,
removing the duplicate from the first file
then appending the lines of file1 to file2

Illustration by example

Suppose, The two files are test1 and test2.

$ cat test2
www.xyz.com/abc-2
www.xyz.com/abc-3
www.xyz.com/abc-4
www.xyz.com/abc-5
www.xyz.com/abc-6

And test1 is

$ cat test1
www.xyz.com/abc-1
www.xyz.com/abc-2
www.xyz.com/abc-3
www.xyz.com/abc-4
www.xyz.com/abc-5

Comparing test1 to test2 and removing duplicates from test 1

Result Required:

$ cat test1
www.xyz.com/abc-1

and then adding this test1 data in to test2

$ cat test2
www.xyz.com/abc-2
www.xyz.com/abc-3
www.xyz.com/abc-4
www.xyz.com/abc-5
www.xyz.com/abc-6
www.xyz.com/abc-1

Solutions Tried:

join -v1 -v2 <(sort test1) <(sort test2)

which resulted into this (that was wrong output)

$ join -v1 -v2 <(sort test1) <(sort test2)
www.xyz.com/abc-1
www.xyz.com/abc-6

Another solution i tried was :

fgrep -vf test1 test2

which resulted nothing.

Does this answer your question? [Deleting lines from one file which are in another file](https://stackoverflow.com/questions/4780203/deleting-lines-from-one-file-which-are-in-another-file) — Pound Hash, Oct 17 '22 at 23:42

score 11 · Answer 1 · answered May 28 '16 at 19:59

Remove lines from test1 because they are in test2:

$ grep -vxFf test2 test1
www.xyz.com/abc-1

To overwrite test1:

grep -vxFf test2 test1 >test1.tmp && mv test1.tmp test1

To append the new test1 to the end of test2:

cat test1 >>test2

The grep options

grep normally prints matching lines. -v tells grep to do the reverse: it prints only lines that do not match

-x tells grep to do whole-line matches.

-F tells grep that we are using fixed strings, not regular expressions.

-f test2 tells grep to read those fixed strings, one per line, from file test2.

`$ grep -vxFf test2 test1` this is resulting nothing. No output. — Ankit Jain, May 29 '16 at 04:55

score 8 · Accepted Answer · answered May 28 '16 at 20:30

With awk:

% awk 'NR == FNR{ a[$0] = 1;next } !a[$0]' test2 test1
www.xyz.com/abc-1

Breakdown:

NR == FNR { # Run for test2 only
  a[$0] = 1 # Store whole line as key in associative array
  next      # Skip next block
}
!a[$0]      # Print line from test1 that are not in a

score 2 · Answer 3 · answered May 28 '16 at 21:04

2

Solution to 1 and 2 problem.

diff test1 test2 |grep "<"|sed  's/< \+//g' > test1.tmp|mv test1.tmp test1

here is the output

$ cat test1
www.xyz.com/abc-1

solution to 3 problem.

cat test1 >> test2

here is the output

$ cat test2
www.xyz.com/abc-2
www.xyz.com/abc-3
www.xyz.com/abc-4
www.xyz.com/abc-5
www.xyz.com/abc-6
www.xyz.com/abc-1

answered May 28 '16 at 21:04

sumitya

2,631
1
19
32

`$ cat test1` output is `< www.xyz.com/abc-1 ` why this `<` ? – Ankit Jain May 29 '16 at 04:57
I have test this in bash, which SHELL you are using? `sed 's/< \+//g'` is handling it already. Please make sure to maintain the mentioned sequence of files in `diff` command. – sumitya May 29 '16 at 05:51

score 0 · Answer 4 · answered May 28 '16 at 22:57

0

If the lines in each file are unique as shown in your sample input then, since you are already sorting the input files in your attempted solutions so sorted output must be OK, this is all you need:

$ sort -u test1 test2
www.xyz.com/abc-1
www.xyz.com/abc-2
www.xyz.com/abc-3
www.xyz.com/abc-4
www.xyz.com/abc-5
www.xyz.com/abc-6

If you need something else then edit your question to clarify your requirements and provide sample input/output that would cause this to break.

answered May 28 '16 at 22:57

Ed Morton

188,023
17
78
185

I guess you didnt read the question properly. I want to remove the duplicates from the test1 file and then appending that to test2 file. – Ankit Jain May 29 '16 at 04:52
I read it perfectly but many times people ask for A when they actually want B and your question sounds like you are describing what you think are the steps required to solve a problem, not the problem itself. Why do you care where the lines from each file end up as long as the result is the unique set of lines from both files? – Ed Morton May 29 '16 at 13:13

comparing two files by lines and removing duplicates from first file

4 Answers4

The grep options