What is the fast process to find the duplicate row from a Csv file?

Question

I have a csv file containing 15,00,000 records and need to find the duplicate rows from the csv file. I am trying as below code

DataTable dtUniqueDataView = dvDataView.ToTable(true, Utility.GetHeadersFromCsv(csvfilePath).Select(c => c.Trim()).ToArray());

But in this I am not getting the duplicate records and it is taking nearly 4 mins time to do the operation. Can any one suggest the process which could reduce the time and give the duplicate result set.

Can you hold the entire file in memory? Also, if the *reading* of the file takes 4 minutes, and this is too long to find the duplicates, the duplicate finding code isn't your problem, you need to find a better/faster way to read the file. — Lasse V. Karlsen, Dec 19 '14 at 07:16
Yes you are correct. I have tried all the ways but no luck tried. — Hemant Kumar, Dec 19 '14 at 07:21
Try to BULK INSERT into a SQL Table, Then Run a Select Statement to find duplicated. — highwingers, Dec 19 '14 at 07:25
@highwingers You are correct but in this application there is not DB calls. — Hemant Kumar, Dec 19 '14 at 07:29
@HemantKumar 6 what columns? ASCII fixed width? UTF-8? How much memory? How big is the file? What disk type is it? Is fast enough good enough? — Aron, Dec 19 '14 at 07:37
Do you interpret the data in any way or will the duplicate rows manifest themselves as duplicate text lines in the file as well? — Lasse V. Karlsen, Dec 19 '14 at 07:39
Use an available efficient CSV parser to parse each line like >>[this](http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader)<<. Then create a class with meaningful properties for each column. The important part: override `Equals` + `GetHashCode` in the desired way(>>[How](http://stackoverflow.com/a/263416/284240)<<). Then you can use this class in a `HashSet` which only allows unique elements. `HashSet.Add` returns a `bool`, if it is `false` you know that you have found a duplicate. Then you haven an efficient, readable, maintainable and robust solution. — Tim Schmelter, Dec 19 '14 at 08:12

Krumelur · Answer 1 · 2014-12-19T07:25:23.483

0

Not the final solution but maybe something to start with:

Read the CSV file line by line and calculate a hash value of each line. You should be able to keep those numeric values in memory.

String.GetHashCode() is not good enough for this purpose as it may return same results for different strings as pointed out correctly in the comments. A more stable hashing algorithm is required.

Store them away in a HashSet<int> and check if the value already exists in there. If yes, you can skip the row.

Note: If most of the time is spent reading the file (as assumed in the comment above), you will have to work on this issue first. My assumption is you're worried about finding the duplicates.

edited Dec 19 '14 at 07:25

answered Dec 19 '14 at 07:17

Krumelur

32,180
27
124
263

Curious about the "-1" - any comments? – Krumelur Dec 19 '14 at 07:18
1

Birthday Paradox. NEVER EVER test for identity with a hashcode exclusively. Your algorithm is almost guarenteed to have a hash collision for the dataset size the OP gave. – Aron Dec 19 '14 at 07:19
If two string objects are equal, the GetHashCode method returns identical values. However, there is not a unique hash code value for each unique string value. **Different strings can return the same hash code**. [MSDN](http://msdn.microsoft.com/es-es/library/system.string.gethashcode%28v=vs.110%29.aspx) – Dubas Dec 19 '14 at 07:19
The concept is good but is required another algorithm to calculate hash that prevents incorrect duplications. Also is required to read the file with a buffer and not only line by line to improve performance. – Dubas Dec 19 '14 at 07:22
@Dubas we call that one the pigeonhole principle. – Aron Dec 19 '14 at 07:23
@Krumelur I am using TextFieldParser to read the csv file and loading into DataTable. – Hemant Kumar Dec 19 '14 at 07:25
1

@Dubas You're right! Revised my answer - maybe it can still serve as a starter for a different implementation. – Krumelur Dec 19 '14 at 07:26
@Krumelur That solution is simply to use a `HashSet` to begin with. .net does all that stuff with HashCode when you add to it...Just for your information, your current algorithm has less than a 1 in a googol's chance of working correctly. https://www.google.com.hk/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=exp(-1500000%5E2+%2F++2%5E33) – Aron Dec 19 '14 at 07:29
@Aron With `HashSet` there will be all content of the file in memory (a lot of strings). If that is not a problem, it's a good solution. Maybe a combination would work? Store hash and a line index and if there's a hash collision, check for string equality. – Krumelur Dec 19 '14 at 07:33
I am aware of the limitations of memory...but the OP never noted the bounds of the problem for optimisation. We don't know that he doesn't have enough ram. Besides, depending on his disk, you could have expensive random access on the file. – Aron Dec 19 '14 at 07:35

score 0 · Answer 2 · answered Dec 19 '14 at 07:51

0

Read the csv file as a stream. Read it only a line at a time. For each line read, calculate the MD5 hash and compare if the hash already exists in your stash. If it does it's a duplicate.

answered Dec 19 '14 at 07:51

Mark Menchavez

1,651
12
15

This answer is worse than @Krumelur's. Misunderstanding what a hash is. Application of a cryptohash, where a simple hash would do. The one good thing here is that the hash is longer. – Aron Dec 19 '14 at 08:00
@Aron, While the MD5 alogirthm is not collission free and does not guarantee uniqueness, for practical purposes this will serve the need of Hermant. It is unique enough. This is why MD5 and other crypto hashes can be used as message digests. – Mark Menchavez Dec 19 '14 at 08:19
I would suggest a non-cryptohash instead, such as [SpookyHash](http://www.nuget.org/packages/System.Data.HashFunction.SpookyHash/). – Aron Dec 22 '14 at 01:19

Mitja · Answer 3 · 2014-12-19T09:56:19.613

I wrote an example with Hashset:

Output (15,000,000 entries in a csv file): Reading File File distinct read in 1600,6632 ms

Output (30,000,000 entries in a csv file): Reading File File distinct read in 3192,1997 ms

Output (45,000,000 entries in a csv file): Reading File File distinct read in 4906,0755 ms

class Program
{
    static void Main(string[] args)
    {
        string csvFile = "test.csv";
        if (!File.Exists(csvFile)) //Create a test CSV file
            CreateCSVFile(csvFile, 45000000, 15000);
        List<string> distinct = GetDistinct(csvFile); //Returns every line once

        Console.ReadKey();
    }
    static List<string> GetDistinct(string filename)
    {
        Stopwatch sw = new Stopwatch();//just a timer
        List<HashSet<string>> lines = new List<HashSet<string>>(); //Hashset is very fast in searching duplicates
        HashSet<string> current = new HashSet<string>(); //This hashset is used at the moment
        lines.Add(current); //Add the current Hashset to a list of hashsets
        sw.Restart(); //just a timer
        Console.WriteLine("Reading File"); //just an output message
        foreach (string line in File.ReadLines(filename))
        {
            try
            {
                if (lines.TrueForAll(hSet => !hSet.Contains(line))) //Look for an existing entry in one of the hashsets
                    current.Add(line); //If line not found, at the line to the current hashset
            }
            catch (OutOfMemoryException ex) //Hashset throws an Exception by ca 12,000,000 elements
            {
                current = new HashSet<string>() { line }; //The line could not added before, add the line to the new hashset
                lines.Add(current); //add the current hashset to the List of hashsets
            }
        }
        sw.Stop();//just a timer
        Console.WriteLine("File distinct read in " + sw.Elapsed.TotalMilliseconds + " ms");//just an output message
        List<string> concatinated = new List<string>(); //Create a list of strings out of the hashset list
        lines.ForEach(set => concatinated.AddRange(set)); //Fill the list of strings
        return concatinated; //Return the list
    }
    static void CreateCSVFile(string filename, int entries, int duplicateRow)
    {
        StringBuilder sb = new StringBuilder();
        using (FileStream fs = File.OpenWrite(filename))
        using (StreamWriter sw = new StreamWriter(fs))
        {
            Random r = new Random();
            string duplicateLine = null;
            string line = "";
            for (int i = 0; i < entries; i++)
            {
                line = r.Next(1, 10) + ";" + r.Next(11, 45) + ";" + r.Next(20, 500) + ";" + r.Next(2, 11) + ";" + r.Next(12, 46) + ";" + r.Next(21, 501);
                sw.WriteLine(line);
                if (i % duplicateRow == 0)
                {
                    if (duplicateLine != null && i < entries - 1)
                    {
                        sb.AppendLine(duplicateLine);
                        i++;
                    }
                    duplicateLine = line;
                }
            }
        }
    }
}

What I find really interesting about this solution is to catch the OOM of the HashSet and create a new one. One might think OOM really means "no more memory" but in this case it just means the HashSet cannot hold any more data. Creative :-) — Krumelur, Dec 19 '14 at 10:10

What is the fast process to find the duplicate row from a Csv file?

3 Answers3

Linked