3

I am running a large dataset and comparing the similarity of the strings in this dataset. If two (or more) strings are similar enough, they will be grouped together under a key-value pair in a dictionary. Eventually, I will end up with:

{1 : [string1, string2, ...], 2: [string3, string4, ...] ...}

Therefore, all the similar strings will be grouped into pairs with numerical keys. However, since the dataset is too large and takes a long time to run, so I saved everything using Pickle, dumping the processed dictionary into a file in the txt format.

However, when I tried to retrieve the data from the Pickle file, the dictionary is distorted: there is only one key-value pair, and all the strings are grouped into the same group:

{1: [all the strings]}

At first I thought there was something with my code, but I ran a small portion of the dataset and it worked. I saved my data using the standard pickle.dump function. I'm wondering if there is a memory or size limit to Pickle and if that's the underlying problem?

EDIT:

I would love to provide a minimal example, but the code that I used to compare strings is a bit complicated, and this is the Pickle code that I used:

with open("product30.txt", "wb") as f:
    pickle.dump(mydict, f)

And this is the code that I used to retrieve the Pickled data:

retrieveddict = pickle.load(open( "product30.txt", "rb"))

I'm certain that the comparison code is correct. I ran a few random portions of the code and it worked, and the data in my dataset is written in the same format: basically a list of 5,000 physical addresses.

qiaop
  • 129
  • 3
  • 7
  • You should file this as a python bug. – simonzack Aug 11 '14 at 15:12
  • 2
    I can't replicate this - a 1000-key dictionary of 1000-element lists of random 5-character strings is pickled and restored and compares equal. Could you provide a [minimal example](http://stackoverflow.com/help/mcve) that allows others to recreate the issue? – jonrsharpe Aug 11 '14 at 15:13
  • There is a similar question here with an accepted answer: http://stackoverflow.com/questions/2108293/saving-huge-bigram-dictionary-to-file-using-pickle – Joe Smart Aug 11 '14 at 15:17
  • @jonrsharpe My dataset contains 5,000 physical addresses (building name, street number, street name, road, city, state, zipcode) and I'm comparing the similarity among the addresses. – qiaop Aug 11 '14 at 15:18
  • How big is the dumped file? – simonzack Aug 11 '14 at 15:18
  • @simonzack about 2.8mb – qiaop Aug 11 '14 at 15:19
  • Which version of Python? Have you tried `dict == retrieveddict` (with what results)? Why are you calling your own variable `dict`? 2.8MB doesn't seem like much - the data I mention above came to 17 (v2.7.6). – jonrsharpe Aug 11 '14 at 15:24
  • @jonrsharpe I actually didn't use "dict" as the name. I put it in here as an example. My Python version is 2.7. I couldn't use dict == retrieveddict because it takes 10 hours to run the comparison code so I just tested random portions of the dataset and everything worked out. – qiaop Aug 11 '14 at 15:56
  • I don't think this question is a dupe. The linked question above does not provide an answer to "Is there a size or memory limit to pickle." Instead, it provides some hand-wavy, best practice answer. – Max Aug 11 '14 at 16:08
  • This question has strong overlap with: http://stackoverflow.com/questions/25128560/python-pickle-dumping-a-very-huge-list and http://stackoverflow.com/questions/3957765/loading-a-large-dictionary-using-python-pickle. Not exactly a dupe, but a near dupe. Other questions ask how to serialize big dicts or lists… this asks "is there a memory limit". – Mike McKerns Aug 11 '14 at 16:19
  • although a solution for the problem (as noted in the links above) is the same. If you have a large dict, you should map the dict so each key-value pair corresponds to it's own serialized file (or an entry in a database table). – Mike McKerns Aug 11 '14 at 16:23
  • and yes, pickle can choke on a large object if it runs of of memory. – Mike McKerns Aug 11 '14 at 16:39
  • I think the question "Is there a size or memory limit to Pickle?" is another one then asking "How to deal with large pickles" as mentioned and marked as duplicate. – Michael Dorner Apr 30 '19 at 07:56

0 Answers0