1

I'm trying to find the fastest way to compare two text files line by line, combining the results in a single multidimensional array with flags for the differences, e.g.:

array(0) (line1) (in file a) (in file b)
array(1) (line2) (in file a) (not in file b)
array(2) (line3) (not in file a) (in file b)

I know how to write my own code for this but it's slow, so I wonder if there is some kind of .NET method that's faster? Currently working in VS2008 .NET 3.5, but will probably move project to VS2013/15, so whatever framework that does the job best will do.

Rado
  • 151
  • 2
  • 13
  • How large datasets? ie. how many lines in each file? – Lasse V. Karlsen Nov 26 '14 at 14:23
  • and...it's probably best to show what you have and ask for suggestions – kenny Nov 26 '14 at 14:24
  • 2
    Do you want to know if lineX of file1 is the exact same as lineX in file2 or do you want to know if lineX of file1 is *somewhere* in file2? – Corak Nov 26 '14 at 14:25
  • Is order important? Meaning, if you first copy one file onto the other, then move the first line of one of the files to the bottom of the same file, and then compare, should you get a difference? – Lasse V. Karlsen Nov 26 '14 at 14:26
  • Up to 20000 lines for each file, perhaps more. – Rado Nov 26 '14 at 14:26
  • Order of lines in files does not matter. – Rado Nov 26 '14 at 14:27
  • 1
    Then you don't want a diff, you want a simple set-based algorithm, try `Except` LINQ Extension method, ie. `fileA.Except(fileB)` to see which lines are in `fileA` but not in `FileB`, and then reverse. – Lasse V. Karlsen Nov 26 '14 at 14:28
  • What I have - well, comparing each line in file 1 with all lines in file 2, then doing the reverse, using for/next or foreach. – Rado Nov 26 '14 at 14:30
  • @Rado - Also, if you read the lines into two `HashSet` collections, the comparing would be a lot quicker than with `List` or `string[]`. - because then only the strings (lines) with the same hash code will be compared, insead of all of them each time. – Corak Nov 26 '14 at 14:35
  • @Corak Thanks, looks interesting! Think I'll compare the different suggestions and see what's fastest. – Rado Nov 26 '14 at 14:47
  • Have a look at [this post!](http://stackoverflow.com/questions/24887238/how-to-compare-two-rich-text-box-contents-and-highlight-the-characters-that-are/24970638#24970638) – TaW Nov 26 '14 at 15:06

1 Answers1

2

What you need is an implementation of the classic diff algorithm in C#.

Here's one on CodeProject: http://www.codeproject.com/Articles/6943/A-Generic-Reusable-Diff-Algorithm-in-C-II

Here are a few other variations: http://www.mathertel.de/Diff/

And finally, http://devdirective.com/post/115/creating-a-reusable-though-simple-diff-implementation-in-csharp-part-3

Good luck!

Roy Dictus
  • 32,551
  • 8
  • 60
  • 76