How do I use HashSet to remove duplicates from a text file? (C#)

Question

So I've decided to create a program that does quite a few things. As a part of this program there's a section called "text tools" which takes a text file (via 1 button) and then has additional buttons that perform other functions like removing whitespace and empty lines from the file, removing duplicates and removing lines that match a certain pattern eg 123 or abc.

I'm able to import the file and print the list using a foreach loop and I believe I'm along the right lines however I need to remove duplicates. I've decided to use HashSet thanks to this thread in which it says it's the simplest and fastest method (my file will contain million of lines).

The problem is that I can't figure out just what I'm doing wrong, I've got the event handler for the button click, created a list of strings in memory, looped through each line in the file (adding it to the list) then creating another list and setting that to be the HashSet of list. (sorry if that's convoluted, it doesn't work for a reason).

I've looked at every stackoverflow question similar to this but I can't find any solution. I've also looked into HashSet in general to no avail.

Here's my code so far:

        private void btnClearDuplicates_Copy_Click(object sender, RoutedEventArgs e)
    {
        List<string> list = new List<string>();

        foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))
        {
            list.Add(line);
        }

        var DuplicatesRemoved = new HashSet<String>(list);


    }

https://stackoverflow.com/questions/31052953/how-to-convert-listt-to-hashsett-in-c — Mitch Wheat, Nov 15 '18 at 02:13
https://learn.microsoft.com/en-us/dotnet/api/system.collections.generic.hashset-1.-ctor?view=netframework-4.7.2#System_Collections_Generic_HashSet_1__ctor_System_Collections_Generic_IEnumerable__0__ — mjwills, Nov 15 '18 at 02:19
cannot convert from 'System.Collections.Generic.List' to 'System.Collections.Generic.IEqualityComparer' — Cpt Price, Nov 15 '18 at 02:20
`Respectfully I didn't open the question to ask for links that I've already found` If you are going to be snarky, at least provide the links that you **have** read. We aren't mind readers. :) — mjwills, Nov 15 '18 at 02:24
ok well I've read the above two. I said in the OP that I've googled it a little bit so the first few pages on there (and on stackoverflow) I've seen. I can't list every website because I haven't been keeping track. — Cpt Price, Nov 15 '18 at 02:26
I did that, it solved the problem however it's not getting any of the list, I've set a breakpoint after the file has been looped through and all lines are added to the list variable but it's not transfering over into the hashset — Cpt Price, Nov 15 '18 at 02:27
So, you want to load "million of lines" into a collection and then make a copy of it? — Ňɏssa Pøngjǣrdenlarp, Nov 15 '18 at 02:29
The file has millions of lines, through whatever method is fastest I want to delete duplicates from that file and replace that file, making a collections and writing that collections seems like the most simple method to me — Cpt Price, Nov 15 '18 at 02:33
I'd suggest stopping using the `List` altogether and use a `HashSet` then. You don't need the `List`. Note that `HashSet` could, in theory, return data in a different order than in the file (it won't with the current implementation, but it could in future). — mjwills, Nov 15 '18 at 02:36
that bit does work now. I used the solution below which allowed me to remove the use of list (like you said) and looping through them. I have another problem in removing the whitespaces from the text file in the same way but should I update my thread or create a new post? — Cpt Price, Nov 15 '18 at 03:21

score 2 · Accepted Answer · answered Nov 15 '18 at 02:29

2

To be specific to your question, and to get my last 3 points.

var lines = File.ReadAllLines("somepath");
var hashSet = new HashSet<string>(lines);
File.WriteAllLines("somepath", hashSet.ToList());

Note there are other ways and maybe more performant ways of doing this. it depends on the amount of duplicates, and the size of the file

answered Nov 15 '18 at 02:29

TheGeneral

79,002
9
103
141

2 things: 1) Would this write the files to the same path as it took them from? (just to clarify) 2) I used ReadLines above because people said it was faster, would there be any impact on performance between the two methods using a file that has millions of lines? – Cpt Price Nov 15 '18 at 02:32
1

@CollegeAmeteur millions of lines is a completely different optimization, and there maybe several things involved to make this more efficient than `ReadAllLines` and `ReadLines`. what i suggest you do, download a benchmark tool and see what works for you. – TheGeneral Nov 15 '18 at 02:36

score 1 · Answer 2 · answered Nov 15 '18 at 03:24

It is preferable to process file as a stream if possible. I would not even call it optimization, I would rather call it not wasting. If you can use stream approach, ReadAllLines approach is somewhere between almost good and very bad, depending on situation. It is also good idea to preserve lines order. HashSet generally does not preserve order, if you store everything into it and read it, it can be shuffled.

using (var outFile = new StreamWriter(outFilePath))
{
    HashSet<string> seen = new HashSet<string>();
    foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))
    {
        if (seen.Add(line))
        {
            outFile.WriteLine(line);
        }
    }
}

How do I use HashSet to remove duplicates from a text file? (C#)

2 Answers2