Here's a pretty efficient solution - I think it's roughly O(n), but it depends on the distribution of adds and deletes. Memory consumption is pretty low, but also depends on the number of consecutive adds and deletes.
Limitations:
- This algorithm doesn't keep the patch lines in the same order that they were in the original file; if that's critical, you could do something like using Dictionary<string,int> where the key is the line and the value is the original line number rather than using HashSet<string> to track added and removed lines.
- The target ("new") file must be somewhat similar to the source ("old") file. Specifically, all unchanged lines must be in the same order in the source and the target. If this condition isn't met the algorithm will behave badly.
- Each line must be unique with regard to lines near it, where "near" means between the nearest lines which are unchanged between source and target. If this condition isn't met, the algorithm will miss changes.
- This implementation doesn't account for modified lines. I think you could add that functionality by replacing == comparisons with whatever operation you use for detecting that two lines are the "same" line, then writing them out to the patch if they are the "same" line with content changes.
The algorithm uses a pair of "added" and "removed" buffers to track potentially added and removed lines as it runs through the files. Lines are tentatively marked "added" or "removed" when they don't match between the files. When a tentatively marked line is found in one of the files (if a "removed" line is found in the target file, or a "added" line is found in the source file) that is a signal which indicates that all the lines in the other buffer do belong there, so the other buffer is flushed to the patch file, then the reader is advanced one line in the file where the matched line was found.
For example:
Source Target Added Removed
A-------A _ _
B-------X +X +B
C-------B Flush X -B
D--\ \-C _ _
E-\ \---E +E +D
F \----F -E Flush D
Here's the code:
public void Diff(
string sourcePath,
string targetPath,
string patchPath,
string addedSuffix,
string removedSuffix)
{
using(var sourceReader = new StreamReader(sourcePath))
using(var targetReader = new StreamReader(targetPath))
using(var patchWriter = new StreamWriter(patchPath, append:false))
{
var sourceLine = sourceReader.ReadLine();
var targetLine = targetReader.ReadLine();
var added = new HashSet<string>();
var removed = new HashSet<string>();
do{
if(sourceLine == targetLine)
{
sourceLine = sourceReader.ReadLine();
targetLine = targetReader.ReadLine();
}
else
{
if(removed.Contains(targetLine))
{
// Found targetLine in tentatively removed lines, so it wasn't actually removed.
removed.Remove(targetLine);
// Since we found something we thought had been removed, we know that all tentatively added lines actually are new.
Flush(patchWriter, added, addedSuffix);
added.Clear();
targetLine = targetReader.ReadLine();
}
else if(added.Contains(sourceLine))
{
// Found sourceLine in tentatively added lines, so it wasn't actually added.
added.Remove(sourceLine);
// We found something we thought had been added, so all candidates for removal should actually be removed.
Flush(patchWriter,removed, removedSuffix);
removed.Clear();
sourceLine = sourceReader.ReadLine();
}
else
{
// Source and target don't match, so we assume that the source was removed and the target was added.
// If we're wrong, we'll clean it up when we come across the line later on.
removed.Add(sourceLine);
added.Add(targetLine);
sourceLine = sourceReader.ReadLine();
targetLine = targetReader.ReadLine();
}
}
} while(sourceLine != null || targetLine != null);
Flush(patchWriter, added, addedSuffix);
Flush(patchWriter, removed, removedSuffix);
}
}
public void Flush(StreamWriter writer, IEnumerable<string> lines, string suffix)
{
foreach (var line in lines.Where (l => l != null))
{
writer.WriteLine("{0} {1}", line.Trim(), suffix);
}
}
Here's some code that I used to generate test files:
var path = /* path */;
var sourcePath = Path.Combine(path, "source.txt");
var targetPath = Path.Combine(path, "target.txt");
var expectedPath = Path.Combine(path, "expected.txt");
var rnd = new Random(10);
using(var sourceWriter = new StreamWriter(sourcePath))
using(var targetWriter = new StreamWriter(targetPath))
using(var expectedWriter = new StreamWriter(expectedPath))
{
var limit = 10.0 * 100000;
for (int i = 0; i < limit; i++)
{
if(i % 10000 == 0) Console.Write("{0:P0} ...", i / limit);
var guid = Guid.NewGuid().ToString();
var r = rnd.Next(0,10);
var removed = 3;
var added = 6;
if(r >= 0 && r < removed)
{
sourceWriter.WriteLine(guid);
expectedWriter.WriteLine(guid + " 0");
}
else if(r >= removed && r < added)
{
targetWriter.WriteLine(guid);
expectedWriter.WriteLine(guid + " 1");
}
else if(r >= added)
{
sourceWriter.WriteLine(guid);
targetWriter.WriteLine(guid);
}
}
}
See any mistakes or issues? Is this what you're looking for?