How to remove duplicate lines from a large text file efficiently?

Question

I want to edit a text as each line exists once in it. Each lines contains constantly 10 characters. I am generally working on 5-6 million of lines. So the code i am using currently is consuming too much RAM.

My code:

File.WriteAllLines(targetpath, File.ReadAllLines(sourcepath).Distinct())

So how can I make it less RAM consumer and less time-consumer at the same time?

I guess one way is to make a small hash of every line and find the duplicates that way (probably involves some sorting). Then use that result to remove the duplicates from the original file. Hashing can be tricky though and I'm no expert. — bronlund, Mar 25 '17 at 13:23
@bronlund how are you going to hash 10 characters into a hash that utilizes less space and doesn't cause collisions? — CodeCaster, Mar 25 '17 at 13:28
Are the duplicate lines following each other, or can the last line in the file be a duplicate of the first line? What are the exact properties of the ten-character lines? Must the order of the input file be maintained in the output? One approach might be to chunk the input into separate lists first, by starting character(s), then searching through those lists when processing the next line. — CodeCaster, Mar 25 '17 at 13:29
@CodeCaster You're are probably right, that could be tricky. Another way could be to split the file up in chunks and compare the file to one chunk at a time maybe - yeah, right, as he already said %] — bronlund, Mar 25 '17 at 13:34
Another thing that might be helpful is Bloom filters, but those can give false positives so would need a second round of checks anyway. — bronlund, Mar 25 '17 at 13:38
I am thinking about use stream to read the source file line by line, and create a temp file named by the content for each line. In the end, write all temp file names into the target file and delete all temp files. — Liu, Mar 25 '17 at 13:43
Anyway this code should run just fine. What do you mean by _" currently is consuming too much RAM"_? What is the actual problem? The only way for this code to consume less RAM is to "bucketize" the data into files and only loading relevant files one at a time, which will degrade performance. — CodeCaster, Mar 25 '17 at 13:43
@Jianping you don't want to create 5-6 million files for this. — CodeCaster, Mar 25 '17 at 13:44
@CodeCaster actually I am trying to index all lines and I think use disk space is better than using RAM — Liu, Mar 25 '17 at 13:45
@jianping again, you do not want to create 5-6 million files. The data we're talking about here is 50-60 MB, which should fit perfectly in RAM. — CodeCaster, Mar 25 '17 at 13:47
@JianpingLiu creative but wrecks your filesystem and probably your hard drive as well. — Steffen Winkler, Mar 25 '17 at 14:01
Are the characters in a fixed range? It might be possible to convert them into a numeric datatype to reduce memory overheads. — spender, Mar 25 '17 at 14:01
I have to say, that I do not understand the necessity for RAM performance here, 100MB can be done even on RaspberryPi, so it is more issue of not wasting the RAM with an algorithm... — ipavlu, Mar 25 '17 at 14:05
Also consider that writing to disk is muuuch slower than writing to RAM. — vyrp, Mar 25 '17 at 14:16

score 3 · Answer 1 · edited May 23 '17 at 12:17

Taking into account how much memory a string will take in C#, and assuming 10 characters length for 6 million records we get:

size in bytes ~= 20 + (length / 2 ) * 4;
total size in bytes ~= (20 + ( 10 / 2 ) * 4 )* 6000000 = 240 000 000
total size in Mb ~= 230

Now, 230 MB of space is not really a problem, even on x86 (32 bit system), so you can load all that data in memory. For this, I would use a HashSet class which is obviously, a hash set that will let you easily eliminate the duplicates, by using lookup before adding an element.

In terms of big-O notation for time complexity, the average performance of a lookup in a hash set is O(1), which is the best you can get. In total, you would use lookup N times, totalling to N * O(1) = O(N)

In terms of big-O notation for space complexity, you would have O(N) space used, meaning that you use up memory proportional to number of elements, which is also the best you can get.

I'm not sure it is even possible to use up less space if you implement the algorithm in C# and not rely on any external components (that would also use at least O(N))

That being said, you can optimize for some scenarios by reading your file sequentially, line by line, see here. This would give a better result if you have lots of duplicates, but worst case scenario when all the lines are distinct would consume the same amount of memory.

On a final note, if you look how Distinct method is implemented, you will see it also uses an implementation of hash table, although it's not the same class, but the performance is still roughly the same, check out this question for more details.

There is but one important issue. int32 hash of the HashSet. There will be hash conflicts and hash conflicts are not necessarily line duplicities. Distinct is testing items when hash conflict happens. — ipavlu, Mar 25 '17 at 16:02
@ipavlu, yes there will be conflicts, however string hashing implementation has very good uniform distribution across all int32 values, and there are over 4 million of them, and we have only 6 million values. Also HashSet of string only uses int32 key for hash bucket adressing, and after that it uses chaining and comparison for string equality. So there are no string instances "dropped" because of the same hash. That is how most hash tables work. LINQ uses a Set class that is just a copy paste of HashSet class. See implementation https://referencesource.microsoft.com/ — ironstone13, Mar 25 '17 at 16:22
You are correct, I was still thinking about keeping just hash no data, but HashSet does both... — ipavlu, Mar 25 '17 at 17:10
Btw, I made tests and created 6mil lines file from random numbers. It is easy to get a lot of conflicts, especially when we do not know anything about the data. About 5.8mil were in the first bucket(dictionary in my case), little less than 200k in second and few in third. — ipavlu, Mar 25 '17 at 17:15

ipavlu · Answer 2 · 2017-03-25T17:27:50.823

As ironstone13 corrected me, HashSet is OK, but does store the data. Then this works fine too:

        string[] arr = File.ReadAllLines("file.txt");
        HashSet<string> hashes = new HashSet<string>();

        for (int i = 0; i < arr.Length; i++)
        {
            if (!hashes.Add(arr[i])) arr[i] = null;
        }

        File.WriteAllLines("file2.txt", arr.Where(x => x != null));

This implementation was motivated by memory performance and hash conflicts. The main idea was to keep just hashes, of course it would have to get back to file to get the line it sees as hash conflict/duplicit, to detect which one it is. (that part is not implemented).

class Program
{
    static string[] arr;
    static Dictionary<int, int>[] hashes = new Dictionary<int, int>[1]
    { new Dictionary<int, int>() }
    ;
    static int[] file_indexes = {-1};


    static void AddHash(int hash, int index)
    {
        for (int h = 0; h < hashes.Length; h++)
        {
            Dictionary<int, int> dict = hashes[h];
            if (!dict.ContainsKey(hash))
            {
                dict[hash] = index;
                return;
            }
        }
        hashes = hashes.Union(new[] {new Dictionary<int, int>() {{hash, index}}}).ToArray();
        file_indexes = Enumerable.Range(0, hashes.Length).Select(x => -1).ToArray();
    }

    static int UpdateFileIndexes(int hash)
    {
        int updates = 0;
        for (int h = 0; h < hashes.Length; h++)
        {
            int index;
            if (hashes[h].TryGetValue(hash, out index))
            {
                file_indexes[h] = index;
                updates++;
            }
            else
            {
                file_indexes[h] = -1;
            }
        }
        return updates;
    }

    static bool IsDuplicate(int index)
    {
        string str1 = arr[index];
        for (int h = 0; h < hashes.Length; h++)
        {
            int i = file_indexes[h];
            if (i == -1 || index == i) continue;
            string str0 = arr[i];
            if (str0 == null) continue;
            if (string.CompareOrdinal(str0, str1) == 0) return true;
        }
        return false;
    }


    static void Main(string[] args)
    {
        arr = File.ReadAllLines("file.txt");

        for (int i = 0; i < arr.Length; i++)
        {
            int hash = arr[i].GetHashCode();

            if (UpdateFileIndexes(hash) == 0) AddHash(hash, i);
            else if (IsDuplicate(i)) arr[i] = null;
            else AddHash(hash, i);
        }

        File.WriteAllLines("file2.txt", arr.Where(x => x != null));



        Console.WriteLine("DONE");
        Console.ReadKey();
    }
}

I think you have to use Array.LongLength which is of Int64 type, instead of Array.Length since it is only Int32 and is obviously too small for input of 6 mln items — ironstone13, Mar 25 '17 at 20:40
@ironstone The maximum value for Int32 is around 2 billion (more exactly 2147483647), which is clearly bigger than 6 million (6000000). — vyrp, Mar 25 '17 at 21:05
@ipvalu You say your code is optimized for performance, but you do realize that `arr = File.ReadAllLines("file.txt")` is storing all the text of the file in memory, right? — vyrp, Mar 25 '17 at 21:07
@vyrp, true, my inability to calculate 2^32 / 2 - 1 clearly indicates it's time to go to bed — ironstone13, Mar 25 '17 at 21:08
@vyrp Ofcourse I do :). I was just playing with the idea of remembering only the hash code and line index and when necessary, two hash codes are same, going to file. With given rigid file structure it would be easy to seek to requested line index. The algorithm I provided is possible to extend in that direction and it would limit the memory consumption considerably, but time consumption would be worse. — ipavlu, Mar 25 '17 at 23:35

score -5 · Accepted Answer · answered Mar 25 '17 at 13:23

-5

Before you write your data, if your data is in a list or dictionary, you could run LINQ query and use group by to group all like keys. Then for each write to the output file.

Your question is a little vague as well. Are you creating a next text file every time and do you have to store the data in text? There are better formats to use such as XML and json

answered Mar 25 '17 at 13:23

chris clifton

133
1
13

2

Performing more operations, especially LINQ operations, might lead very fast to more time in processing and more RAM, with grouping for sure. The grouping is not necessary here... – ipavlu Mar 25 '17 at 13:29
1

Why is this answer the accepted answer? Using Linq's Group is not performant neither in time nor space. Other answers Using HashSet are better. – vyrp Mar 25 '17 at 21:10
@Ali Tor, this suggestion is no better then your original code – ironstone13 Mar 25 '17 at 21:14
Chris, please note, that the original question is using .Distinct() method. It is LINQ! :) And actualy NO: XML/JSON are not better formats for memory or processing time performance, they were developed for easy exchange where developer understands simple plain text easily. Hence lately development of binary formats like google protocol buffers, if you are serious about memory and time processing performance... – ipavlu Mar 25 '17 at 23:43
Noted, personally never tried to work with that amount of data in a text file. Always had some sort of database. I was thinking if he had this info in memory that he could use linq to remove duplicates through linq – chris clifton Mar 26 '17 at 00:43

How to remove duplicate lines from a large text file efficiently?

3 Answers3