0

I am using diff to find the differences between two text file. It was working great but, when I change the order of lines in the text files, it show the similar text in the result file.

Here is file1.txt:

>gi17
AAAAAA
>gi30
BBBBBB
>gi40
CCCCCC
>gi92
DDDDDD
>gi50
EEEEEE
>gi81
FFFFFF

File2.txt

>gi40
CCCCCC
>gi01
BBBBBB
>gi02
AAAAAA
>gi30
BBBBBB

Result.txt:

>gi17
AAAAAA
>gi30        ???
BBBBBB       ???
>gi92
DDDDDD
>gi01
BBBBBB
>gi50
EEEEEE
>gi81
FFFFFF
>gi02
AAAAAA
>gi30        ???
BBBBBB       ???

Diff statement:

$ diff C:/Users/User/Desktop/File1.txt C:/Users/User/Desktop/File2.txt > C:/Users/User/Desktop/Result.txt

Why it displays

>gi30
BBBBBB 

as a defferent?

Edit 1: What I want is to search the occurrence of each line in the file 1 in the whole file 2 because the two files are not ordered and I cannot touch them (genetic data).

Edit 2: I want to execute join command from my php code. it run successfully in cygwin cmd application but, it did not run from my php

shell_exec("C:\\cygwin64\\bin\\bash.exe --login -c 'join -v 1 <(sort $OldDatabaseFile.txt) <(sort $NewDatabaseFile.txt) > $text_files_path/DelSeqGi.txt 2>&1'");

Thanks.

sara
  • 183
  • 1
  • 2
  • 13
  • 2
    `diff` checks the differences also in order. Try with two simple files, each one with numbers 1 to 5 but in different order. The diff will show all of them. – fedorqui Apr 19 '16 at 07:20
  • @fedorqui OMG! Is there a way to ignore the order and search the occurrence in the whoe file? – sara Apr 19 '16 at 07:27
  • @sara sort the file beforehand. – 123 Apr 19 '16 at 07:52
  • Did you google for [`fasta diff`](https://www.google.com/search?q=fasta+diff) before asking? – tripleee Apr 19 '16 at 11:03
  • @tripleee No. Is it for genetic? – sara Apr 19 '16 at 11:05
  • That could very easily be googled, too. Yes, FASTA is a very common bioinformatics file format. Your samples look exacly like FASTA. – tripleee Apr 19 '16 at 11:06
  • @tripleee I have 2 database versions and I want to extract just the new sequences (which in version2 text file )and delete sequences (which in version1 text file) to do some process. Is fasta diff helpfull? It is my first time see it. – sara Apr 19 '16 at 11:09
  • For diffing two FASTA files, that's what you would look for, yes. I don't link to any particular implementation, but you want a tool which understands the FASTA format; there are many to choose from. – tripleee Apr 19 '16 at 11:19
  • @tripleee I find something called: fadiff it do exactly what I want. From where can I run it? I using Windows – sara Apr 19 '16 at 11:19
  • Stack Overflow is not a software recommendation site. If you need help using software you downloaded from the Internet, try https://superuser.com/ or the download site's support department if they have one. – tripleee Apr 19 '16 at 11:21
  • @tripleee No it is not a software. It is a command line 'fadiff [OPTIONS] ' – sara Apr 19 '16 at 11:30
  • You are probably confused about what you found. This looks exactly like a tool you would have to download and install locally. – tripleee Apr 19 '16 at 12:08

2 Answers2

0

As fedorqui said in the comment, diff compare files line by line.

To achieve what you want, you can do :

comm -3 <(sort f1.txt) <(sort f2.txt) > result.txt

Manual (relevant part) :

comm - compare two sorted files line by line

       -1     suppress column 1 (lines unique to FILE1)

       -2     suppress column 2 (lines unique to FILE2)

       -3     suppress column 3 (lines that appear in both files)


EXAMPLES
  comm -3 file1 file2
    Print lines in file1 not in file2, and vice versa.
rdupz
  • 2,204
  • 2
  • 13
  • 21
  • I try it, it gives me: comm: file 2 is not in sorted order, comm: file 1 is not in sorted order. and it print the common lines between the two files! I want the difference not the common. – sara Apr 19 '16 at 07:32
0

To get the difference between files use bash join util as below:-

DESCRIPTION
     The join utility performs an ``equality join'' on the specified files and
     writes the result to the standard output.  The ``join field'' is the
     field in each file by which the files are compared.  The first field in
     each line is used by default.  There is one line in the output for each
     pair of lines in file1 and file2 which have identical join fields.  Each
     output line consists of the join field, the remaining fields from file1
     and then the remaining fields from file2.

 -v file_number
         Do not display the default output, but display a line for each
         unpairable line in file file_number.  The options -v 1 and -v 2
         may be specified at the same time.

 -1 field
         Join on the field'th field of file1.

 -2 field
         Join on the field'th field of file2.

join -v 1 <(sort file1.txt) <(sort file2.txt)     # To get the lines in file file1.txt which file file2.txt does not have
join -v 2 <(sort file1.txt) <(sort file2.txt)     # Vice Versa of above

Original answer/Credits:- https://stackoverflow.com/a/4544780/5291015

Community
  • 1
  • 1
Inian
  • 80,270
  • 14
  • 142
  • 161
  • It is a great tool and enhance my work. I wrote extra code after perform diff. I perform diff then write loop that compare the result of diff with file 1 and but the matched in another file. then do another loop to compare the result of diff with file 2 and but the matched in another file . join do exactly what I want just in one line. But, I have some comment about the output. First, I want to do some test then come back here. – sara Apr 19 '16 at 08:08
  • Should probably start a new post for it with the steps followed, answering it here will not be under the current questions's scope. – Inian Apr 19 '16 at 10:53