I am running a large dataset and comparing the similarity of the strings in this dataset. If two (or more) strings are similar enough, they will be grouped together under a key-value pair in a dictionary. Eventually, I will end up with:
{1 : [string1, string2, ...], 2: [string3, string4, ...] ...}
Therefore, all the similar strings will be grouped into pairs with numerical keys. However, since the dataset is too large and takes a long time to run, so I saved everything using Pickle, dumping the processed dictionary into a file in the txt format.
However, when I tried to retrieve the data from the Pickle file, the dictionary is distorted: there is only one key-value pair, and all the strings are grouped into the same group:
{1: [all the strings]}
At first I thought there was something with my code, but I ran a small portion of the dataset and it worked. I saved my data using the standard pickle.dump function. I'm wondering if there is a memory or size limit to Pickle and if that's the underlying problem?
EDIT:
I would love to provide a minimal example, but the code that I used to compare strings is a bit complicated, and this is the Pickle code that I used:
with open("product30.txt", "wb") as f:
pickle.dump(mydict, f)
And this is the code that I used to retrieve the Pickled data:
retrieveddict = pickle.load(open( "product30.txt", "rb"))
I'm certain that the comparison code is correct. I ran a few random portions of the code and it worked, and the data in my dataset is written in the same format: basically a list of 5,000 physical addresses.