I am working on a problem of deduplicating a large set of strings in Python and am tackling the problem with sets.Set(). The input is a set of strings from a text file and output are the same set of strings with duplicates removed.
The script needs to be able to run on a machine with limited main memory(around 2GB) and the problem is that the size of the set gets too big, my input is a 800mb text file.
Part of my code:
for String in InputFile:
StringSet.add(String)
return StringSet
Is there a more efficient way around this problem? I've considered a bloom filter and trie but I'd prefer the O(1) efficiency of Set().
Edit: I've switched from sets.Set() to set(), the latter which is supposed to be more memory efficient, but still not efficient enough.