1

I am working on a problem of deduplicating a large set of strings in Python and am tackling the problem with sets.Set(). The input is a set of strings from a text file and output are the same set of strings with duplicates removed.

The script needs to be able to run on a machine with limited main memory(around 2GB) and the problem is that the size of the set gets too big, my input is a 800mb text file.

Part of my code:

for String in InputFile:
    StringSet.add(String)

return StringSet

Is there a more efficient way around this problem? I've considered a bloom filter and trie but I'd prefer the O(1) efficiency of Set().

Edit: I've switched from sets.Set() to set(), the latter which is supposed to be more memory efficient, but still not efficient enough.

gamerx
  • 579
  • 5
  • 16
  • How do you know it's running out of memory because of the set size - what actually happens? Is that exactly your code? File iteration is a generator already so there's not much left you can change; are you keeping 'String' around too long and it isn't being collected? Are you open to trying third party extensions e.g. https://pypi.python.org/pypi/blist/ ? (not sure if relevant, isn't O(1), but is an example). – TessellatingHeckler Dec 05 '14 at 06:26
  • The above is a greatly simplified version of my code. I know I'm running out of memory because a MemoryError is thrown for StringSet on a machine with less than 4GB of main memory but things work fine on my 16GB desktop. – gamerx Dec 05 '14 at 06:31
  • What is wrong with [set](https://docs.python.org/2/library/stdtypes.html#set)? Why are you going for [sets.Set](https://docs.python.org/2/library/sets.html#sets.Set) (which is deprecated btw)? – smac89 Dec 05 '14 at 06:31
  • @Smac89 Ahhh....didn't noticed that sets.Set() is deprecated. Just read about it in the Python documentation and apparently set() has a more memory efficient pickle. Shall try it out and see if it solves my problem. Thanks – gamerx Dec 05 '14 at 06:36
  • @gamerx have you seen https://pythonhosted.org/Pympler/ for investigating memory use of Python objects? I haven't used it, but maybe it can help you track down which things are using the most memory and see if anything surprising is also taking up space. – TessellatingHeckler Dec 05 '14 at 06:40
  • Depending what you want to use the `set` for, you could always put it in an (on-disk) key store DB - that way you're putting overhead on disk usage (and supporting data) and it'll be more costly to initially build but if you're going to re-use it, then it'll be persistent, so less costly to do again. It'll be slower than an in-memory `set`, but you'll have membership testing and be able to loop over keys etc... just depends what problem you're trying to solve – Jon Clements Dec 05 '14 at 09:35

1 Answers1

4

Rather than storing the existing strings to remove duplicates you could do a two file mechanism, on the reading a line, calculate a hash, if the hash is not in the existing hash set the line would be written to the output file and the hash added to the hash set, as you would only need to have the table of hashes and a single line in memory at any one time this would be more memory efficient unless your lines are smaller than the hash length. You should also find that it will be a lot faster as comparing even 64 byte hash entries is a lot quicker than comparing all the characters in two lines.

Some thing like:

import hashlib
hasher = hashlib.sha1 # You could use md5, SHA224, SHA256, SHA384 or SHA512
hashset = set() # You could also use a list here - speed V size would be interesting
with open(input_name, 'r') as infile:
    with open(output_name, 'w') as outfile:
        for line in infile.readlines():
            h = hasher(line).digest() # Hash the line
            if not h in hashset:
                outfile.write(line)
                hashset.add(h) # if using a list append(h)

The only question is balancing the hash type and length between size and chance of duplication or collision. For information the digest sizes in bytes and claimed chances of the same value for different input are:

Hash Bytes    Security Claim
md5    16     1:2^64
sha1   20     1:2^80
sha224 28 
sha256 32     1:2^128
sha384 48     
sha512 64     1:2^256
Steve Barnes
  • 27,618
  • 6
  • 63
  • 73
  • You must be careful about hash collisions! Different strings may map to the same hash value, especially if you're using small hashes like md5. – Rob Dec 05 '14 at 08:03
  • @Rob - that is why I had the point about balancing the hash length v chance of duplication. I will add to this. – Steve Barnes Dec 05 '14 at 08:20
  • This does not work. Internally python uses hashtables to implement sets, and therefore the size of items in the set doesn't matter (your hash values will be hashed again by python before being stored in a set: A set is basically just a dict with dummy values) – tomas Nov 26 '15 at 11:14
  • 1
    @tomas - the point is that each line can potentially be 100s or thousands of bytes while your stored hash will be a fixed length for each entry - the original line can be discarded once you have calculated the hash whereas if you are adding the lines to the set they will still all be in RAM. – Steve Barnes Nov 27 '15 at 06:37
  • @SteveBarnes You are right of course, sorry. What I was thinking was that python will hash those lines anyways when constructing the set (so it's either a hash of the lines, or a hash of the hash of the lines). But obviously this only accounts for the length of the hashtable. The real values are of course saved there as well, otherwise we wouldn't be able to retrieve them :p. stackoverflow will not let me remove the downvote unless the answer is editted, which is embarrassing for me. Maybe you can make a small change so I can remove it? apologies. – tomas Dec 08 '15 at 16:54
  • @tomas - I have edited to make the point a little clearer and added a point about expected speed-up. – Steve Barnes Dec 08 '15 at 18:17