Java efficiency in comparing large checksum lists

Question

I'm trying to compare two directories having about 15k files each for any changes. A is the newer version and B must be updated to it.

I have two large checksum list files, call them A and B. A is newer and B is an older version. Each have about 15k entries that look sort of like the following :

<entry1 -filepath> <entry1 -checksum>
<entry2 -filepath> <entry2 -checksum>
<entry3 -filepath> <entry3 -checksum>
.                  .
.                  .
.                  .

The entries are listed in alphabetical order. The two files need to be compared to check the following :

1. Two entries have the same file path but different checksums.
2. An entry exists in file A but not in file B.
3. An entry exists in file B but not in file A.

My proposal algorithm :

int currentBLine = -1;

for(int index = 0; index < A.length; index++)
{
    String newfilepath = A[index].getFilePath();
    String newchecksum = A[index].getCheckSum();

    for(; currentBLine < B.length; currentBLine++)
    {
        String oldfilepath = B[currentBLine].getFilePath();
        String oldchecksum = B[currentBLine].getCheckSum();

        if(filepath.compareTo(oldfilepath) > 0)
        {
            deleteFile(oldfilepath);
        }
        else if(filepath.compareTo(oldfilepath) == 0)
        {
            if(checksum.equals(oldchecksum)
            {
                currentBLine++;
                break;
            }
            else
            {
                updateFile(oldfilepath, newfilepath);
                break;
            }
        }
        else
        {
            createFile(newfilepath);
            break;
        }
    }
}

Is this the most efficient way of doing this? Am I doing something wrong here?

If anyone sees an XY problem, just let me know and I will fill in the background.

How much time does this code take to execute ? If its not too much, then I would not worry about efficiency. If too much or if number of files increases daily, then you should consider looking for a way to monitor the folder A for any changes. Then, only copy the changes or delta to folder B. — Erran Morad, Oct 05 '14 at 00:54
Yes, that will be done. But I am preparing for the case where the computer has just booted up and is is polling a server for changes to the directory. — Hele, Oct 05 '14 at 00:55
Just wondering. Can you make folder-a read only for a certain amount of time and then make it rw ? — Erran Morad, Oct 05 '14 at 01:00
Yes. But highly not recommended as folder A is being actively used by another program. Every few minutes, this synchronization routine will be performed. So if the folder write permissions were removed, the other program (Eclipse Luna!!!) would be affected. — Hele, Oct 05 '14 at 01:04
One more thing, what if a new file is added to folderA and your loop skips that file ? Is that acceptable or will you take care of that new file when you run your loop code again ? — Erran Morad, Oct 05 '14 at 01:07
If the new loop is already nearing the end, and a file is added just then, the new file that is added is detected using a directory monitor and is synchronized at the end of the loop. Basically, all changes that occur after the start of the sync loop are added to a list and are executed at the end of the loop. — Hele, Oct 05 '14 at 01:10

score 1 · Accepted Answer · edited May 23 '17 at 11:58

1

The program that you have (two nested loops with the break in the inner loop) implements the standard algorithm for processing two sorted collections together. It is similar to the one that you use when merging two sorted lists: make two indexes, one for each list, and loop until both lists reach the end.

You can bring it to its classic form by making it a single loop instead of using two nested loops. At each step of the loop you perform the comparison similar to what you've got in the three-way if statement that you have. The only difference is that you wouldn't use a break, and you would need to check the indexes into A and B to be within their limits. If both indexes are within A and B limits, compare the files and check sums the way that you have coded. If you reached the end of A, delete the B file. If you reached the end of B, copy the A file. The loop ends once you have exhausted both lists.

edited May 23 '17 at 11:58

Community

1
1

answered Oct 05 '14 at 01:22

Sergey Kalinichenko

714,442
84
1,110
1,523

1

But wouldn't that be pretty similar to my code? I do have 2 nested loops, but the inner loop runs only about 2-3 entries each time. – Hele Oct 05 '14 at 01:31
@Hele Ah, I see what you are doing (I missed the `break` on the first reading). Does your program work correctly, though? Because when you break out of that inner loop, you do not increment `currentBLine`, yet `A`'s `index` gets incremented. This means that you would never keep the `B` file. – Sergey Kalinichenko Oct 05 '14 at 01:37
I have'nt really run the code, it's just pseudo code. But yeah, you're right. It should be incremented before breaking. I will edit my question. – Hele Oct 05 '14 at 01:42
@Hele Try running your code then. The overall algorithm is as efficient as it gets, because it goes through both lists only once. – Sergey Kalinichenko Oct 05 '14 at 01:49
Just one more thing, If I werent to use the innner loop, how would i know if there were multiple simultaneous entries in b which arent in a? – Hele Oct 05 '14 at 10:42
@Hele I think you would end up using `break` there as well, the way you did in the edit. The only place where you would continue with the loop is when `B` is "catching up" to `A`. Another problem with the nested loop approach is that once you run out of entries in `A`, your `B` loop would stop as well (because it is nested), so you would end up with another loop after the nested ones to process the remaining entries of `B`. I would go with a single loop, though, or with the other classic solution of three sequential loop (on A&B's overlap, then on remaining items of `A`, then on `B`). – Sergey Kalinichenko Oct 05 '14 at 10:47
Yeah, I've almost completed the implementation. There are a number of changes to be made including that .compareTo in Java is lexicographic and we need it to be alphabetic. I will make the changes soon. – Hele Oct 05 '14 at 11:07
Also, I dont understand how you could do it without an inner loop as you said. If there are say, 20 simultaneous entries in B which are new files which dont even exist in A and need to be deleted. What would happenm in this case? – Hele Oct 05 '14 at 11:08
@Hele Let's say `A` has files p and t, while `B` has p, q, r, s, and v. The loop will process p, and increment both indexes. Then it would see t in `A` and q in `B`. Since t is after q, the code would delete q, and advance only `B`'s index. It would do the same on the next two iterations, when `B` has r and s, both coming before t. After that it's t and v. Now `A`s entry is before `B`'s, so the file is copied, and only `A`'s index is moved forward. – Sergey Kalinichenko Oct 05 '14 at 11:21
Oh yeah, that makes sense. Just in my case, that wouldnt work case I'm using a Java while(Scanner.hasNextLine()) so i cant loop without incrementing A. But thanks :D – Hele Oct 05 '14 at 11:27

Java efficiency in comparing large checksum lists

1 Answers1