There are two files called "a.txt" and "b.txt" both have a list of words. Now I want to check which words are extra in "a.txt" and are not in "b.txt".
I need a efficient algorithm as I need to compare two dictionaries.
There are two files called "a.txt" and "b.txt" both have a list of words. Now I want to check which words are extra in "a.txt" and are not in "b.txt".
I need a efficient algorithm as I need to compare two dictionaries.
if you have vim installed,try this:
vimdiff file1 file2
or
vim -d file1 file2
you will find it fantastic.
Sort them and use comm
:
comm -23 <(sort a.txt) <(sort b.txt)
comm
compares (sorted) input files and by default outputs three columns: lines that are unique to a, lines that are unique to b, and lines that are present in both. By specifying -1
, -2
and/or -3
you can suppress the corresponding output. Therefore comm -23 a b
lists only the entries that are unique to a. I use the <(...)
syntax to sort the files on the fly, if they are already sorted you don't need this.
If you prefer the diff output style from git diff
, you can use it with the --no-index
flag to compare files not in a git repository:
git diff --no-index a.txt b.txt
Using a couple of files with around 200k file name strings in each, I benchmarked (with the built-in time
command) this approach vs some of the other answers here:
git diff --no-index a.txt b.txt
# ~1.2s
comm -23 <(sort a.txt) <(sort b.txt)
# ~0.2s
diff a.txt b.txt
# ~2.6s
sdiff a.txt b.txt
# ~2.7s
vimdiff a.txt b.txt
# ~3.2s
comm
seems to be the fastest by far, while git diff --no-index
appears to be the fastest approach for diff-style output.
Update 2018-03-25 You can actually omit the --no-index
flag unless you are inside a git repository and want to compare untracked files within that repository. From the man pages:
This form is to compare the given two paths on the filesystem. You can omit the --no-index option when running the command in a working tree controlled by Git and at least one of the paths points outside the working tree, or when running the command outside a working tree controlled by Git.
You can use diff
tool in linux to compare two files. You can use --changed-group-format and --unchanged-group-format options to filter required data.
Following three options can use to select the relevant group for each option:
'%<' get lines from FILE1
'%>' get lines from FILE2
'' (empty string) for removing lines from both files.
E.g: diff --changed-group-format="%<" --unchanged-group-format="" file1.txt file2.txt
[root@vmoracle11 tmp]# cat file1.txt
test one
test two
test three
test four
test eight
[root@vmoracle11 tmp]# cat file2.txt
test one
test three
test nine
[root@vmoracle11 tmp]# diff --changed-group-format='%<' --unchanged-group-format='' file1.txt file2.txt
test two
test four
test eight
You can also use: colordiff: Displays the output of diff with colors.
About vimdiff: It allows you to compare files via SSH, for example :
vimdiff /var/log/secure scp://192.168.1.25/var/log/secure
Extracted from: http://www.sysadmit.com/2016/05/linux-diferencias-entre-dos-archivos.html
Also, do not forget about mcdiff - Internal diff viewer of GNU Midnight Commander.
For example:
mcdiff file1 file2
Enjoy!
Use comm -13
(requires sorted files):
$ cat file1
one
two
three
$ cat file2
one
two
three
four
$ comm -13 <(sort file1) <(sort file2)
four
You can also use:
sdiff file1 file2
To display differences side by side within your terminal!
diff a.txt b.txt | grep '<'
can then pipe to cut for a clean output
diff a.txt b.txt | grep '<' | cut -c 3
Here is my solution for this :
mkdir temp
mkdir results
cp /usr/share/dict/american-english ~/temp/american-english-dictionary
cp /usr/share/dict/british-english ~/temp/british-english-dictionary
cat ~/temp/american-english-dictionary | wc -l > ~/results/count-american-english-dictionary
cat ~/temp/british-english-dictionary | wc -l > ~/results/count-british-english-dictionary
grep -Fxf ~/temp/american-english-dictionary ~/temp/british-english-dictionary > ~/results/common-english
grep -Fxvf ~/results/common-english ~/temp/american-english-dictionary > ~/results/unique-american-english
grep -Fxvf ~/results/common-english ~/temp/british-english-dictionary > ~/results/unique-british-english
Using awk for it. Test files:
$ cat a.txt
one
two
three
four
four
$ cat b.txt
three
two
one
The awk:
$ awk '
NR==FNR { # process b.txt or the first file
seen[$0] # hash words to hash seen
next # next word in b.txt
} # process a.txt or all files after the first
!($0 in seen)' b.txt a.txt # if word is not hashed to seen, output it
Duplicates are outputed:
four
four
To avoid duplicates, add each newly met word in a.txt to seen
hash:
$ awk '
NR==FNR {
seen[$0]
next
}
!($0 in seen) { # if word is not hashed to seen
seen[$0] # hash unseen a.txt words to seen to avoid duplicates
print # and output it
}' b.txt a.txt
Output:
four
If the word lists are comma-separated, like:
$ cat a.txt
four,four,three,three,two,one
five,six
$ cat b.txt
one,two,three
you have to do a couple of extra laps (for
loops):
awk -F, ' # comma-separated input
NR==FNR {
for(i=1;i<=NF;i++) # loop all comma-separated fields
seen[$i]
next
}
{
for(i=1;i<=NF;i++)
if(!($i in seen)) {
seen[$i] # this time we buffer output (below):
buffer=buffer (buffer==""?"":",") $i
}
if(buffer!="") { # output unempty buffers after each record in a.txt
print buffer
buffer=""
}
}' b.txt a.txt
Output this time:
four
five,six