Comparing HUGE ASCII Files

Question

I work for a company that does ETL work on various databases. I am tasked with creating a patch for two full historical data sets on the client machine, which would then be sent over to our servers. This patch needs to be programmatic so that it can be called from our software.

The datasets are simple text files. We have extraction software running on our client's systems to perform the extraction. Extraction files range in size, up to 3GB+. I have implemented a solution using Microsoft's FC.exe, but it has limitations.

I'm using FC to produce the comparison file, then parsing it in perl on our side to extract the records that have been removed/updated, and those that have been added.

FC works perfectly fine for me as long as the line of text does not exceed 128 characters. When that happens the output is put on to the next line of the comparison file, and so appears as an added/deleted record. I know that I could probably pre-process the files, but this will add a tremendous amount of time, probably defeating the purpose.

I have tried using diffutils, but it complains about large files.

I also toyed with some c# code to implement the patch process myself. This worked fine for small files, but was horribly inefficient when dealing with the big ones (tested it on a 2.8 GB extract)

Are there any good command-line utilities or c# libraries that I can utilize to create this patch file? Barring that, is there an algorithm that I can use to implement this myself? Keep in mind that records may be updated, added, and deleted (I know, it irks me too that the clients DELETE records, rather than marking them inactive. This is out of my control.)

Edit for clarity:

I need to compare two separate database extracts from two different times. Usually these will be about one day apart.

Given the below files: (these will obviously be much longer and much wider)

Old.txt

a
b
c
d
e
1
f
2
5

New.txt

a
3
b
c
4
d
e
1
f
g

The expected output would be:

3 added
4 added
2 removed
g added
5 removed

You might try the GNUWin32 version of diff.exe. Never used it on a file that large, but it might work. For a C# solution, have you seen this: [link](http://stackoverflow.com/questions/1271225/c-sharp-reading-a-file-line-by-line) — Brian.D.Myers, Jul 18 '13 at 23:58
diffutils, as I mentioned trying, is part of GNUWin32. It does not work at all for files this large. That link is pretty basic. I already understand how to read files in c#, the problem is comparing two files which can be different at any point (as opposed to only being appended to, in which case this would be easy) — rbedger, Jul 19 '13 at 10:30
After some preliminary, tests it seems that what you are after is pretty doable in C#; if you explain the exact conditions better, I can try to come up with simplified code to prove the point. For example: what are roughly speaking the actions performed during the patch process (+ maximum time expectations)? — varocarbas, Jul 22 '13 at 16:13
Essentially a 'patch' needs to contain the entire row for any case, update, add, or delete. For update/delete cases, a '0' should be appended to the row (this indicates removal). For update/add cases, a '1' should be added to the row (this indicates addition). Maximum processing time should not exceed one hour for files ~4GB. Unfortunately I can't provide you with sample data as it contains confidential health information. — rbedger, Jul 22 '13 at 16:51
Instead of pre-processing your files, maybe you could try to post-process FC.exe output to fixit. Shouldn't be too hard: for any line that's 128 chars long in the output, look it up in the input to find if you should merge it with the next line. Should be reasonably efficient as long as you don't have too many 128+ chars lines that are different. — jods, Jul 22 '13 at 17:40
@jods - wouldn't that require a full scan of the input files to locate the line? — rbedger, Jul 22 '13 at 19:19
I have written an answer with the results of my tests. Let me know what you think. — varocarbas, Jul 22 '13 at 19:39
@rbedger: you're right. I was thinking "access by line number" but this is not something that's easily done. You could optimize by doing a single pass (diffs are sorted by increasing line numbers), but that's still one additional pass over your data :/ — jods, Jul 23 '13 at 13:08
Your output confuses me: How do you know that 2 has been removed before the line "g is added"? Isn't it possible that the 2 is just right after g in the New.txt? — D.R., Jul 23 '13 at 18:26
In my example 'Old.txt' and 'New.txt' are complete files. So you know that 2 has been removed because it doesn't exist in 'New.txt' at all. — rbedger, Jul 23 '13 at 18:59
if the problem is (as your examples show) just to detect Added and Removed record the, the problem is easy. However, if you need to detect Modified record, it is quite different. Do you need to detect modified records? — lontivero, Jul 24 '13 at 01:11
@lontivero - Yes, I need to detect modified records as well. — rbedger, Jul 24 '13 at 11:33
Does order of the lines in the files matter? Can the extraction process be tweaked to output the records in a certain order? Also, how would you differentiate a new record from the modified record? Is there a part of the record that can be used as a key to establish identity of the record? — Olaf, Jul 24 '13 at 14:28

Steve Ruble · Accepted Answer · 2013-07-24T14:23:10.687

Here's a pretty efficient solution - I think it's roughly O(n), but it depends on the distribution of adds and deletes. Memory consumption is pretty low, but also depends on the number of consecutive adds and deletes.

Limitations:

This algorithm doesn't keep the patch lines in the same order that they were in the original file; if that's critical, you could do something like using Dictionary<string,int> where the key is the line and the value is the original line number rather than using HashSet<string> to track added and removed lines.
The target ("new") file must be somewhat similar to the source ("old") file. Specifically, all unchanged lines must be in the same order in the source and the target. If this condition isn't met the algorithm will behave badly.
Each line must be unique with regard to lines near it, where "near" means between the nearest lines which are unchanged between source and target. If this condition isn't met, the algorithm will miss changes.
This implementation doesn't account for modified lines. I think you could add that functionality by replacing == comparisons with whatever operation you use for detecting that two lines are the "same" line, then writing them out to the patch if they are the "same" line with content changes.

The algorithm uses a pair of "added" and "removed" buffers to track potentially added and removed lines as it runs through the files. Lines are tentatively marked "added" or "removed" when they don't match between the files. When a tentatively marked line is found in one of the files (if a "removed" line is found in the target file, or a "added" line is found in the source file) that is a signal which indicates that all the lines in the other buffer do belong there, so the other buffer is flushed to the patch file, then the reader is advanced one line in the file where the matched line was found.

For example:

 
Source  Target Added   Removed
A-------A      _       _
B-------X      +X      +B
C-------B      Flush X -B
D--\  \-C      _       _
E-\ \---E      +E      +D
F  \----F      -E      Flush D

Here's the code:

public void Diff(
    string sourcePath,
    string targetPath,
    string patchPath,
    string addedSuffix,
    string removedSuffix)

{
    using(var sourceReader = new StreamReader(sourcePath))
    using(var targetReader = new StreamReader(targetPath))
    using(var patchWriter = new StreamWriter(patchPath, append:false))
    {   
        var sourceLine = sourceReader.ReadLine();
        var targetLine = targetReader.ReadLine();

        var added = new HashSet<string>();
        var removed = new HashSet<string>();

        do{
            if(sourceLine == targetLine)
            {   
                sourceLine = sourceReader.ReadLine();
                targetLine = targetReader.ReadLine();
            }
            else
            {
                if(removed.Contains(targetLine))
                {
                    // Found targetLine in tentatively removed lines, so it wasn't actually removed.
                    removed.Remove(targetLine);
                    // Since we found something we thought had been removed, we know that all tentatively added lines actually are new.
                    Flush(patchWriter, added, addedSuffix);             
                    added.Clear();

                    targetLine = targetReader.ReadLine();               
                } 
                else if(added.Contains(sourceLine))
                {
                    // Found sourceLine in tentatively added lines, so it wasn't actually added.
                    added.Remove(sourceLine);
                    // We found something we thought had been added, so all candidates for removal should actually be removed.
                    Flush(patchWriter,removed, removedSuffix);
                    removed.Clear();

                    sourceLine = sourceReader.ReadLine();               
                }
                else
                {
                    // Source and target don't match, so we assume that the source was removed and the target was added.
                    // If we're wrong, we'll clean it up when we come across the line later on.
                    removed.Add(sourceLine);
                    added.Add(targetLine);
                    sourceLine = sourceReader.ReadLine();               
                    targetLine = targetReader.ReadLine();               
                }       
            }   
        } while(sourceLine != null || targetLine != null); 

        Flush(patchWriter, added, addedSuffix);
        Flush(patchWriter, removed, removedSuffix);
    }
}

public void Flush(StreamWriter writer, IEnumerable<string> lines, string suffix)
{
    foreach (var line in lines.Where (l => l != null))
    {
        writer.WriteLine("{0} {1}", line.Trim(), suffix);
    }
}

Here's some code that I used to generate test files:

var path = /* path */;
var sourcePath = Path.Combine(path, "source.txt");
var targetPath = Path.Combine(path, "target.txt");
var expectedPath = Path.Combine(path, "expected.txt");
var rnd = new Random(10);

using(var sourceWriter = new StreamWriter(sourcePath))
using(var targetWriter = new StreamWriter(targetPath))
using(var expectedWriter = new StreamWriter(expectedPath))
{
    var limit = 10.0 * 100000;
    for (int i = 0; i < limit; i++)
    {
        if(i % 10000 == 0) Console.Write("{0:P0} ...", i / limit);
        var guid = Guid.NewGuid().ToString();
        var r = rnd.Next(0,10);
        var removed = 3;
        var added = 6;
        if(r >= 0 && r < removed)
        {
            sourceWriter.WriteLine(guid);
            expectedWriter.WriteLine(guid + " 0");
        }
        else if(r >= removed && r < added)
        {
            targetWriter.WriteLine(guid);
            expectedWriter.WriteLine(guid + " 1");
        }
        else if(r >= added)
        {   
            sourceWriter.WriteLine(guid);
            targetWriter.WriteLine(guid);           
        }
    }
}

See any mistakes or issues? Is this what you're looking for?

That´s what I asked about modified records, because the real problem is detecting them (added and removed are piece of cake). Just a comment, using hashes (or hash involved techniques like hashtables/dictionaries) is not okey because they will have collision in big files with lots of lines. — lontivero, Jul 24 '13 at 13:32
lontivero, the hash sets never get larger than the longest string of consecutive added/removed lines, so that shouldn't be a problem unless there are huge blocks of changes (and even so, System.String probably has a pretty balanced hash algo). I've updated my answer to note the update detection requirement, which I didn't see before I posted the answer. — Steve Ruble, Jul 24 '13 at 14:25
I wrote a very very similar algorithm to yours but using dictionary with the string.hash/string pair and it failed always when tested with 4GB files (as a proof, I posted it my spanish blog: http://geeks.ms/blogs/lontivero/archive/2013/07/23/string-gethashcode-colisiones.aspx) that´s why is say that even when System.String.GetHashCode is very well balanced, it is an Int32 value (2^32 hashes as much) and you have high probabilities to get a collision. This comment is just to share my warning, simply that. — lontivero, Jul 24 '13 at 14:39
@lotivero, the MSDN guidance for GetHashCode specifically warns not to use the value returned from GetHashCode as a key in a keyed collection. GetHashCode is for internal use by hash-using types like HashSet and Dictionary, not for key generation. http://msdn.microsoft.com/en-us/library/system.object.gethashcode.aspx — Steve Ruble, Jul 24 '13 at 17:02
This works perfectly for my purposes. Thanks for your help, and thanks for providing well formatted source code! — rbedger, Jul 24 '13 at 18:41

score 0 · Answer 2 · answered Jul 24 '13 at 02:10

0

Well, you are comparing 2 text files, each one has entries that are not necessarily in any order, and I'm expecting an entry to have a certain format to it, if i understand this right, what you really have would be this: * represents start of entry @ represents end of entry so OLD.TXT *a@*b@*c@ etc... a crude "algorithm" would be: 1) make copy of NEW, call it ADDED 2) get entry from OLD 3.0) scan ADDED for that entry. If there, save entry in a file called STILLEXISTS, delete that entry from the ADDED file 3.1) if entry is not in ADDED, save in a file called DELETED, and get next entry from OLD 4) When this ends, you'll have 3 files, each with entries that were added, deleted and a bonus "still there" file, all in one pass ;) Hope I understood it right, and this can help you.

answered Jul 24 '13 at 02:10

Using your proposed method, you'd end up scanning the file numerous times since you have no idea the amount of change on either side. It would probably work, but would be very inefficient. – rbedger Jul 24 '13 at 11:35
Yes, it's not very elegant, but it will work :) Also, the second file gets smaller with every pass, making the parsing linearly faster each new entry. If you have too many of these huge files, and Steve Ruble method doesn't suit your data, you can have two processes working, one from the beginning and another from the end, cutting some time? Also, if you expect the change in the data to be less than what wasn't changed, you could swap the scanning operation to saving only new records. That also would accelerate the process. A programmer would definitely be able to improve the basic idea :) – Jul 24 '13 at 13:39

Comparing HUGE ASCII Files

2 Answers2