1

The files under folder1 and folder2 will have same names and I want 2 compare those files. Am struck with this. Is there any JAVA API for doing this comparison. The file sizes may be huge

Example:

folder1/file1
----------
kushi,metha,2
kushi,barun,1
arun,mital,3

folder2/file1
----------
arun,mital,3
kushi,metha,2
sheetal,kumar,3
kushi,barun,1

The comparison of file1 and file2 should return "sheetal kumar 3" I tried googling but not able to find anything useful.

rkosegi
  • 14,165
  • 5
  • 50
  • 83
kushi
  • 389
  • 1
  • 5
  • 11

3 Answers3

2

I know this is not a pure java solution, but if you have access to a *nix box :

sort file1 > sorted1; sort file2 > sorted2;comm -3 sorted1 sorted2;

Would give you exactly what you need.

And then take a look at this question on how you can run shell scripts from java.

EDIT:

What I am trying to say is that for you to compute the diff there are 2 steps :

  1. Sort both the files.
  2. Compare them line by line to find the differences.
Community
  • 1
  • 1
Tejas Kale
  • 415
  • 8
  • 18
0

Depending on what you mean by huge, you could use a HashSet to first go through one file and add each line to the hash set, then, go through the other file and from the hash set, remove the lines you are now reading from the other file. This assumes that each line is unique.

npinti
  • 51,780
  • 5
  • 72
  • 96
  • I thought about this one. But is there any 3rdparty/java api to achieve this? – kushi Mar 14 '14 at 07:14
  • 1
    @kushi: There seems to be something [here](http://code.google.com/p/java-diff-utils/), the problem is that I am not sure if it ignores order. – npinti Mar 14 '14 at 07:22
0

I encountered the same problem, and write a comparison function:

/**
 * Compare two sequences of lines without considering order.
 * <p>
 * Input parameter will not be modified.
 */
public static <T> boolean isEqualWithoutOrder(final T[] lines1, final T[] lines2) {
    if (lines1 == null && lines2 == null) return true;
    if (lines1 == null) return false;
    if (lines2 == null) return false;
    if (lines1.length != lines2.length) return false;

    final int length = lines1.length;
    int equalCnt = 0;

    final boolean[] mask = new boolean[length];
    Arrays.fill(mask, true);

    for (int i = 0; i < lines2.length; i++) {
        final T line2 = lines2[i];
        for (int j = 0; j < lines1.length; j++) {
            final T line1 = lines1[j];
            if (mask[j] && Objects.equal(line1, line2)) {
                equalCnt++;
                mask[j] = false;

                //if two equal lines is found, more subsequent equal lines are speculated
                while (j + 1 < length && i + 1 < length &&
                        Objects.equal(lines1[j + 1], lines2[i + 1])) {
                    equalCnt++;
                    mask[j + 1] = false;
                    j++;
                    i++;
                }

                break;
            }
        }
        if (equalCnt < i) return false;
    }
    return equalCnt == length;
}

Common collections may be slow, speed comparison:

//lines1: Seq[String], lines2: Seq[String] of 100k lines of equal Random String but without ordering.
FastUtils.isEqualWithoutOrder(lines1.toArray, lines2.toArray) //97 ms
lines1.sorted == lines2.sorted //836 ms

Time measured in hot sbt environment.

(Disclaimer: I only did some basic test against this function)

cuz
  • 1,172
  • 11
  • 12