What is the fastest way to compare two text files, not counting moved lines as different

Question

I have two files which are very large in size say 50000 lines each. I need to compare these two files and identify the changes. However, the catch is if a line is present at different position, it should not be shown as different.

For eg, consider this
File A.txt

xxxxx
yyyyy
zzzzz

File B.txt

zzzzz
xxxx
yyyyy

So if this is the content of the file. My code should give the output as xxxx(or both xxxx and xxxxx).

Ofcourse the easiest way would be storing each line of the file in a

List< String>

and comparing with the other

List< String>.

But this seems to be taking a lot of time. I have also tried using the DiffUtils in java. But it doesnt recognize the lines present in diferent line numbers as same. So is there any other algorithm that might help me?

Maybe you can use simple arrays (string[]). This will be much faster. Or if you want to use an finished implementation, you can use FileUtils.contentEquals(file1, file2); from org.apache.commons.io.FileUtils. — marc3l, Sep 14 '15 at 12:58
If you are looking for fastest way then call diff api of linux from java and you are done — prashant thakre, Sep 14 '15 at 13:02
Can the same line appear more than once in the files? If so, and the same line occurs once in one file and twice in the other, are those files the same? — DJClayworth, Sep 14 '15 at 13:12
@DJClayworth In that case too it should be assumed as different. Thanks for bringing up this valuable scenario. — saru10, Sep 14 '15 at 13:21
@saru10 if you are deploying in Linux box then Fastest way to accomplish this tasks is by using then call diff api of linux from java and you are done — prashant thakre, Sep 14 '15 at 14:34

score 2 · Answer 1 · edited May 23 '17 at 11:51

2

In general HashSet would be the best solution, but as we are dealing with strings there are two possible solutions:

saving one file as HashSet and trying to find the lines of other file in it.
saving one file as Trie and trying to find the lines of other file in it

In this post you can find comparison between HashSets and Tries How Do I Choose Between a Hash Table and a Trie (Prefix Tree)?

edited May 23 '17 at 11:51

Community

1
1

answered Sep 14 '15 at 13:00

pointer

307
4
12

2

What would be the keys and values of the table? Why use a Hashtable and not a HashSet? – JB Nizet Sep 14 '15 at 13:00
1

Oh, yes JB Nizet you are right HashSet should be enough. – pointer Sep 14 '15 at 13:03
1

Also doesn't consider the possibility that there are additional lies in the other file, or consider duplicate lines. – DJClayworth Sep 14 '15 at 13:08
@DJClayworth why would you need to consider duplicate lines? – nafas Sep 14 '15 at 13:10
See comments on question. – DJClayworth Sep 14 '15 at 13:22

nafas · Accepted Answer · 2015-09-14T13:09:11.550

1

probably using Set is the easiest way:

Set<String> set1 = new HashSet<String>(FileUtils.readLines(file1));

Set<String> set2 = new HashSet<String>(FileUtils.readLines(file2));


Set<String> similars = new HashSet<String>(set1);

similars.retainAll(set2);

set1.removeAll(similars); //now set1 contains distinct lines in file1
set2.removeAll(similars); //now set2 contains distinct lines in file2
System.out.println(set1); //prints distinct lines in file1;
System.out.println(set2); //prints distinct lines in file2

edited Sep 14 '15 at 13:09

answered Sep 14 '15 at 13:02

nafas

5,283
3
29
57

@DJClayworth we are comparing two files and trying to find distinct lines between them, I don't see why duplicate would be an issue. – nafas Sep 14 '15 at 13:10
@downvoter, could you please collaborate the reason for the downvote? – nafas Sep 14 '15 at 13:20
Is this going to be faster that comparing two lists? – saru10 Sep 14 '15 at 13:27
Also in this line set2.removeAll(similars); //now set2 contains distinct lines in file2. How does the set2 contain distinct lines? similars contains all the entries in set2 though anyway. – saru10 Sep 14 '15 at 13:29
I already did explain the downvote. In the case where the same line occurs twice in the first file, and once in the second, this will not flag the extra occurrence of that line. – DJClayworth Sep 14 '15 at 13:29
@saru10 similars will only contains those that both set1 and set2 contains. (retainAll only keeps the similar, its basically intersection of two sets) – nafas Sep 15 '15 at 11:38
@saru10 note: say if you have String s repeated twice in a file1, and no s in file2, it will only return s one time, not twice – nafas Sep 15 '15 at 12:00
@nafas In that case once is enough. :) – saru10 Sep 15 '15 at 12:36
@nafas. This might help. But still it is going to consume lot of memory and time right? – saru10 Sep 15 '15 at 12:37
1

@saru10 50000 is nothing for a `Set` mate, i do this for file size of millions. the process for 50000 should finish in few secs – nafas Sep 15 '15 at 13:01

score 1 · Answer 3 · answered Sep 14 '15 at 14:00

You need to keep track of the case where the same record might appear more than once in the files. For example, if a record appears twice in file A and once in file B, then you need to record that as an extra record.

Since we have to keep track of the number of occurrences, you need one of:

A Multiset
A Map from record to Integer e.g. Map

With a Multiset, you can add and remove records and it will keep track of the number of times the record has been added (a Set doesn't do that - it rejects an add of a record that is already there). With the Map approach, you have to do a little bit of work so that the integer tracks the number of occurrences. let's consider that approach (the MultiSet is simpler).

With the map, when we talk about 'adding' a record, you look to see if there is an entry for that String in the Map. if there is, replace the value with value+1 for that key. If there isn't, create an entry with the value of 1. When we talk about 'removing an entry', look for an entry for that key. If you find it, replace the value with value-1. If that reduces the value to 0, remove the entry.

Create a Map for each file.
Read a record for one of the files
Check to see if that record exists in the other Map.
If it exists in the other Map, remove that entry (see above for what that means)
If it doesn't exist, add it to the Map for this file (see above)
Repeat until end, alternating files.

The contents of the two Maps will give you the records that appeared in that file but not the other.

Doing this as we go along, rather than building the Maps up front, keeps the memory usage down, but probably doesn't have a big impact on performance.

score 0 · Answer 4 · answered May 27 '19 at 05:28

I think this will be useful,

   BufferedReader reader1 = new BufferedReader(new FileReader("C:\\file1.txt"));

    BufferedReader reader2 = new BufferedReader(new FileReader("C:\\file2.txt"));

    String line1 = reader1.readLine();

    String line2 = reader2.readLine();

    boolean areEqual = true;

    int lineNum = 1;

    while (line1 != null || line2 != null)
    {
        if(line1 == null || line2 == null)
        {
            areEqual = false;

            break;
        }
        else if(! line1.equalsIgnoreCase(line2))
        {
            areEqual = false;

            break;
        }

        line1 = reader1.readLine();

        line2 = reader2.readLine();

        lineNum++;
    }

    if(areEqual)
    {
        System.out.println("Two files have same content.");
    }
    else
    {
        System.out.println("Two files have different content. They differ at line "+lineNum);

        System.out.println("File1 has "+line1+" and File2 has "+line2+" at line "+lineNum);
    }

    reader1.close();

    reader2.close();

score -1 · Answer 5 · answered Sep 14 '15 at 13:01

-1

You could try parsing your first file first, storing all of the lines in a HashMap and then checking whether there is a mapping present for each of the lines of the second file.

This is still O(n), though.

answered Sep 14 '15 at 13:01

Pandatyr

284
2
8

3

Doesn't check whether there are lines in the first file that aren't in the second. Doesn't check if the same line appears the same number of times in both files. – DJClayworth Sep 14 '15 at 13:03
It's possible to create an implementation that checks for those using a HashMap (or, as others pointed out, a HashSet, which would likely already be sufficient). It requires some more effort, but it's likely still better than creating two lists and comparing each one of the entries. – Pandatyr Sep 14 '15 at 13:13

score -1 · Answer 6 · answered Sep 14 '15 at 13:02

-1

Just use a byte comparison with BufferedReader. This will be the fastest way to compare two files. Read a byte block from one file and compare it with the byte block of the other file. First check if the file length is the same.

Or just use FileUtils.contentEquals(file1, file2); from org.apache.commons.io.FileUtils.

answered Sep 14 '15 at 13:02

marc3l

2,525
7
34
62

1

Doesn't allow for the same line in different positions. – DJClayworth Sep 14 '15 at 13:04

score -1 · Answer 7 · answered Sep 14 '15 at 13:03

-1

You can use FileUtils.contentEquals(file1, file2)

It will compare the contents of the 2 files.

Find more information here

answered Sep 14 '15 at 13:03

vbh

17
3

1

Doesn't allow for having the same lines in different order. – DJClayworth Sep 14 '15 at 13:04

What is the fastest way to compare two text files, not counting moved lines as different

7 Answers7

Linked