0

So I have 5 textfiles that are 50GB each and I'd like to combine all of them into 1 textfile and then call the LINQ statement .Distinct() so that there are only unique entries in the new file.

The way I'm doing it now is like so

foreach (var file in files)
{
    if (Path.GetExtension(file) == ".txt")
    {
        var lines = File.ReadAllLines(file);
        var b = lines.Distinct();
        File.AppendAllLines(clear, lines);
        
    }
}

The issue that occurs here is that the application now loads the entire textfile into memory, making my RAM usage go up to 100%. This solution might of worked if I had 64GB of ram but I only have 16GB. What's the best option for me to achieve what I'm trying to accomplish? Should I utilize the cores on my CPU? Running a 5900x.

JohnA
  • 564
  • 1
  • 5
  • 20
  • A line should be distinct within a file or all files? – emoacht Mar 20 '22 at 03:10
  • I'm trying to make it to where the final file only has distinct entries – JohnA Mar 20 '22 at 03:14
  • To save memory, you should [read the file as stream](https://learn.microsoft.com/en-us/dotnet/standard/io/how-to-read-text-from-a-file) and [append to another stream](https://learn.microsoft.com/en-us/dotnet/standard/io/how-to-open-and-append-to-a-log-file) – Ibram Reda Mar 20 '22 at 04:42
  • Read the input file line by line and also write to the output line by line. The tricky part is the Distinct. You simply cannot remember all seen lines in memory. To find a solution here: how long are the lines on average? How many (percentage) duplicates exist? – Klaus Gütter Mar 20 '22 at 05:36
  • @KlausGütter I'd say that the lines on average is about 15 characters long – JohnA Mar 20 '22 at 06:33
  • Here are some hints on how to find duplicates in such a large data set: https://stackoverflow.com/questions/16598062/find-unique-values-from-a-large-file https://stackoverflow.com/questions/10021927/print-unique-lines-of-a-10gb-file – Klaus Gütter Mar 20 '22 at 06:54
  • Is maintaining order important? And how many potential characters are there for the first character? – Mark Cilia Vincenti Mar 20 '22 at 18:40
  • Assuming one line is composed of 15 ASCII characters and CrLf, its size will be 17 Bytes. 50GiB is 53,687,091,200 bytes. So there will be 15,790,320,941 lines in 5 files. Obviously they cannot be processed on memory at a time, you will need to start from dividing them into smaller groups. – emoacht Mar 21 '22 at 00:25

1 Answers1

1

If maintaining order is not important, and if the potential characters are limited (eg A-Z), a possibility would be to say, "OK, let's start with the As".

So you start with each file, and go through line by line until you find a line starting with 'A'. If you find one, add it to a new file and a HashSet. Each time you find a new line starting with 'A', check if it is in the HashSet, and if not add it to both the new file and the HashSet. Once you've processed all files, dispose the HashSet and skip to the next letter (B).

You're going to iterate through the files 26 times this way.

Of course you can optimise it even further. Check how much memory is available and divide the possible characters by ranges, so for example with the first iteration your HashSet might contain anything starting with A-D.

Mark Cilia Vincenti
  • 1,410
  • 8
  • 25