1

I have 2 csv files, file1.csv and file2.csv. Some lines in each file will be identical. I wish to create a 3rd csv file, based upon file2.csv but with any lines that are present in file1.csv removed from it. Effectively I wish to subtract file1.csv from file2.csv ignoring any lines present in file1 that are not in file2. I know that I could use streamreader to read each line in file2.csv and search for it in file1.csv. If it does not exist in file1.csv I can write it to file3.csv. However, the files are very large (over 30000 lines) and I believe this would take a lot of processing time. I suspect there may be a better method of loading each csv to an array and then performing a simple subtraction function on them to obtain the desired result. I would appreciate either some help with the code or on method that I should approach this problem with.

Example content of files:

file1.csv

dt97861.jpg,149954,c1714ee1,\folder1\folderA\,
dt97862.jpg,149955,c1714ee0,\folder1\folderA\,
dt97863.jpg,59368,cd23f223,\folder2\folderA\,
dt97864.jpg,57881,0835be4a,\folder2\folderB\,
dt97865.jpg,57882,0835be4b,\folder2\folderB\,

file2.csv

dt97862.jpg,149955,c1714ee0,\folder1\folderA\,
dt97863.jpg,59368,cd23f223,\folder2\folderA\,
dt97864.jpg,57881,0835be4a,\folder2\folderB\,
dt97865.jpg,57882,0835be4b,\folder2\folderB\,
dt97866.jpg,57883,0835be4c,\folder2\folderB\,
dt97867.jpg,57884,0835be4d,\folder3\folderA\,
dt97868.jpg,57885,0835be4e,\folder3\folderA\,

The results I require is:

file3.csv

dt97866.jpg,57883,0835be4c,\folder2\folderB\,
dt97867.jpg,57884,0835be4d,\folder3\folderA\,
dt97868.jpg,57885,0835be4e,\folder3\folderA\,

EDIT: With the help below I came to the following solution which I believe to be nice and elegant:

public static IEnumerable<string> ReadFile(string path)
        {
            string line;

            using (var reader = File.OpenText(path))
                while ((line = reader.ReadLine()) != null)
                    yield return line;
        }

then:

var file2 = ReadFile(file2FilePath);
var file1 = ReadFile(file1FilePath);
var file3 = file2.Except(file1);
File.WriteAllLines(file3FilePath, file3);
Steve W
  • 1,108
  • 3
  • 13
  • 35
  • 1
    You should show what you have tried please. – Bit May 29 '14 at 16:02
  • I usually would but on this occasion I am not sure what approach to take. I've described 1 method that I think would work but am hoping that somebody will suggest something much more efficient. – Steve W May 29 '14 at 16:04
  • Are both CSV files already in alphabetical order? (It appears that way.) If not, does the order of the output matter? 30,000 lines of length ~50 characters is only a couple of megabytes, so I'd recommend pulling both files in, sorting, and searching for duplicates using a binary search. – adv12 May 29 '14 at 16:11
  • I would use something like EPPlus, load both files into memory collections (if the files aren't that large). Then you could use some LINQ or something to parse out the lines that aren't in File 1. – krillgar May 29 '14 at 16:17
  • If both files are always sorted the same way you can use an [external merge](http://stackoverflow.com/questions/20802396/). – Dour High Arch May 29 '14 at 16:20

3 Answers3

1

While this may not be the best approach, it's the one I've used in the past. It's a bit of a dirty hack, but...

  1. Import both CSV files into a datatable (so you will have two datatables -I personally prefer closed xml if you plan to use an excel type format, otherwise just use a normal file read/write - My example uses regular read/write)
  2. Move data from datatable into a list (my example assumes comma separated values, one per line.)
  3. Find unique values between lists and merge
  4. Export the merged lists to a csv file

*[Edited steps after actually working on the code]

Per request from Bit, I've added an example using sample data from Some Random Website - This was written in VS2008 against .NET 3.5, but it should work on 3.5+. I copied us-500 into 2 versions, the original and modified 1 row to create a unique value to test. This project is targeting x86 platform. I've used a new windows form for testing

using System.Data;
using System.Data.OleDb;
using System.IO;
using System.Linq;
using System.Windows.Forms;

namespace TestSandbox
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            var file1 = new DataTable();
            var file2 = new DataTable();

            InitializeComponent();
            //Gets data from csv file, select allows for filtering
            using (var conn = new OleDbConnection(@"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=c:\;Extended Properties=""text;HDR=Yes;FMT=Delimited"";"))
            {
                conn.Open();
                using (var adapter = new OleDbDataAdapter(@"select * from [us-500.csv]", conn))
                {
                    adapter.Fill(file1);
                }
            }

            using (var conn = new OleDbConnection(@"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=c:\;Extended Properties=""text;HDR=Yes;FMT=Delimited"";"))
            {
                conn.Open();
                using (var adapter = new OleDbDataAdapter(@"select * from [us-500-2.csv]", conn))
                {
                    adapter.Fill(file2);
                }
            }
            //Moves datatable information to lists for comparison
            var file1List = (from DataRow row in file1.Rows select row.ItemArray.Select(field => field.ToString()).ToArray() into fields select string.Join(",", fields)).ToList();
            var file2List = (from DataRow row in file2.Rows select row.ItemArray.Select(field => field.ToString()).ToArray() into fields select string.Join(",", fields)).ToList();
            //Adds all data from file2 into file1 list, except for data that already exists in file1
            file1List.AddRange(file2List.Except(file1List));
            //Exports all results to c:\results.csv
            File.WriteAllLines(@"C:\Results.csv", file1List.ToArray());
        }
    }
}

*Note: After looking at the code, importing straight to a list looks like it would be more efficient, but I'll leave this as is for now since it's not overly complicated.

Nathan
  • 51
  • 3
1

Assume the line is perfectly identical, you can read both file into two IEnumerable<string> and extract with IEnumerable.Except<T>. This will produce the same result regardless of the ordering~

Example :

var file1 = new List<string>{
        @"dt97861.jpg,149954,c1714ee1,\folder1\folderA\,",
        @"dt97862.jpg,149955,c1714ee0,\folder1\folderA\,",
        @"dt97863.jpg,59368,cd23f223,\folder2\folderA\,",
        @"dt97864.jpg,57881,0835be4a,\folder2\folderB\,",
        @"dt97865.jpg,57882,0835be4b,\folder2\folderB\,",
    };

var file2 = new List<string>{
        @"dt97862.jpg,149955,c1714ee0,\folder1\folderA\,",
        @"dt97863.jpg,59368,cd23f223,\folder2\folderA\,",
        @"dt97864.jpg,57881,0835be4a,\folder2\folderB\,",
        @"dt97865.jpg,57882,0835be4b,\folder2\folderB\,",
        @"dt97866.jpg,57883,0835be4c,\folder2\folderB\,",
        @"dt97867.jpg,57884,0835be4d,\folder3\folderA\,",
        @"dt97868.jpg,57885,0835be4e,\folder3\folderA\,",
    };

file2.Except(file1).Dump();

Output :

dt97866.jpg,57883,0835be4c,\folder2\folderB\, 
dt97867.jpg,57884,0835be4d,\folder3\folderA\, 
dt97868.jpg,57885,0835be4e,\folder3\folderA\, 

Here is the function to load any file into IEnumerable<string>. Just dont forget to using System.IO;.

public static IEnumerable<string> ReadFile(string path)
{
    string line;

    using(var reader = File.OpenText(path))
        while((line = reader.ReadLine()) != null)
            yield return line;
}

To write the result to a file :

 //using System.IO; is required
File.WriteAllLines("file3.csv", file2.Except(file1))

Remarks : File.WriteAllLines will create or overwrite the file.

Xiaoy312
  • 14,292
  • 1
  • 32
  • 44
  • This looks good. Please could you show me how to load the csv files into the file1 and file2 var's given the location of the csv files are file1Path and file2Path. – Steve W May 29 '14 at 16:24
  • Thanks. So now I have: var myFullCsv = ReadFile(myFullCsvFilePath); var masterCsv = ReadFile(csvFilePath); //If file not in masterCsv, put it into extras.csv var extraFilesCsv = myFullCsv.Except(masterCsv); Should I use StreamWriter to write this out to the new csv or is there a better way once you have all the data as one var? – Steve W May 29 '14 at 16:51
  • Yes, using [`File.WriteAllLines`](http://msdn.microsoft.com/en-us/library/dd383693(v=vs.110).aspx) – Xiaoy312 May 29 '14 at 17:15
  • Got it working. Wrote out the result with StreamWriter and its super quick. Thanks. – Steve W May 29 '14 at 17:15
  • File.WriteAllLines was even better with a single line of code to create my csv file. Thanks for your help. This was just the sort of solution I was hoping for but was concerned some may feel that I had not made enough effort without posting my own attempt at some code for this. – Steve W May 29 '14 at 18:03
0

Step 1. Using System.IO, we'll read two files using FileStream and create a third file using StreamWriter.

Step 2. Use FileStream to read file #1. e.g.

using (var FS = new System.IO.FileStream(file1, System.IO.FileMode.Open, System.IO.FileAccess.Read)) { ...<insert next steps in here>...}

Step 3. Nest another FileStream to read file #2. This stream will be read multiple times, so it's best if you can put the smaller file in this part of the nest. You can do this by checking the size of the file prior to jumping into these loops.

Step 4. Read in a single line from our biggest file, File#1, then we compare it against ALL lines from File#2 sequentially. If a match is found, set a boolean to TRUE indicating that there is a matching line found in File #2.

Step 5. Once we're at the end of File #2, check for a true/false condition of the boolean. If its false, SAVE the string we read from File #1 into File #3. This is your output file.

Step 6. Reset the stream pointer for File #2 to the beginning of the file e.g. FS.Seek(0, System.IO.SeekOrigin.Begin)

Step 7. Repeat from Step 4 until we've reached the end of File #1. File #3's contents should represent only unique entries from File #1 that are not members of File #2

Redgum
  • 388
  • 5
  • 14