18

From the unix terminal, we can use diff file1 file2 to find the difference between two files. Is there a similar command to show the similarity across 2 files? (many pipes allowed if necessary.

Each file contains a line with a string sentence; they are sorted and duplicate lines removed with sort file1 | uniq.

file1: http://pastebin.com/taRcegVn

file2: http://pastebin.com/2fXeMrHQ

And the output should output the lines that appears in both files.

output: http://pastebin.com/FnjXFshs

I am able to use python to do it as such but i think it's a little too much to put into the terminal:

x = set([i.strip() for i in open('wn-rb.dic')])
y = set([i.strip() for i in open('wn-s.dic')])
z = x.intersection(y)
outfile = open('reverse-diff.out')
for i in z:
  print>>outfile, i
alvas
  • 115,346
  • 109
  • 446
  • 738
  • What do your files look like? – paulmelnikow Mar 18 '13 at 05:26
  • 4
    possible duplicate of [how to show lines in common (reverse diff)?](http://stackoverflow.com/questions/746458/how-to-show-lines-in-common-reverse-diff) – beatgammit Mar 18 '13 at 05:26
  • Most times a string of human language sentences. Sometimes columnized with more information too. – alvas Mar 18 '13 at 05:37
  • Perhaps you could give an example of two simple files and the sort of output you'd like to get from that input? It's not clear to me exactly what you're trying to achieve. It would also be helpful to understand a bit more the motivation for doing this, as someone may have a different approach to solve your problem. – Martin Atkins Mar 18 '13 at 05:38
  • And you want the output to be a list of the lines that both files have in common? – Martin Atkins Mar 18 '13 at 05:52
  • @MartinAtkins, updated the question with the desired output. – alvas Mar 18 '13 at 05:56
  • Can the same line appear more than once in a given file? Is the order of the lines in the files relevant? – Jonathan Leffler Mar 18 '13 at 06:15
  • The same line shouldn't have appeared more than once in the file. i think i did `sort file1 | uniq` before coming the files. The order of the lines shouldn't be a problem too since the sort would have sorted them alphabetically. – alvas Mar 18 '13 at 06:47

2 Answers2

35

If you want to get a list of repeated lines without resorting to AWK, you can use -d flag to uniq:

sort file1 file2 | uniq -d
user35147863
  • 2,525
  • 2
  • 23
  • 25
  • 1
    Also, dropping the -d gets you just all the distinct lines from both files which is what I was looking for. – Aaron Mar 05 '15 at 19:07
17

As @tjameson mentioned it may be solved in another thread. Just would like to post another solution: sort file1 file2 | awk 'dup[$0]++ == 1'

  1. refer to awk guide to get some awk basics, when the pattern value of a line is true this line will be printed

  2. dup[$0] is a hash table in which each key is each line of the input, the original value is 0 and increments once this line occurs, when it occurs again the value should be 1, so dup[$0]++ == 1 is true. Then this line is printed.

Note that this only works when there are not duplicates in either file, as was specified in the question.

michaelgulak
  • 631
  • 1
  • 6
  • 14
  • Can you explain how `awk 'dup[$0]++ == 1'` works? Your solution is much better than the confusing `comm` – alvas Mar 18 '13 at 05:52
  • 1
    `awk` uses `pattern { action }` notation. Since this is not in braces, it is a pattern. `$0` is the current line. `dup[$0]` is an associative array indexed by the lines; when first created, the value is 0; `dup[$0]++` post-increments the value, so it returns 0 the first time, and 1 on the second time, etc. When its value is 1, the condition is true so the default action (print the line) is executed. – Jonathan Leffler Mar 18 '13 at 06:14