2

So I saved a bunch of features as a .pkl file. This is the code I used to save the files initially.

with open('variables.pkl', 'wb') as output:
    pickle.dump(embedding_weights, output, 2)
    pickle.dump(X1, output, 2)
    pickle.dump(X2, output, 2)
    pickle.dump(Y, output, 2)
    pickle.dump(X1_test, output, 2)
    pickle.dump(X2_test, output, 2)
    pickle.dump(Y_test, output, 2)
    pickle.dump(X1_nli, output, 2)
    pickle.dump(X2_nli, output, 2)
    pickle.dump(Y_nli, output, 2)
    pickle.dump(X1_test_nli, output, 2)
    pickle.dump(X2_test_nli, output, 2)
    pickle.dump(Y_test_nli, output, 2)
    pickle.dump(X1_test_matched, output, 2)
    pickle.dump(X2_test_matched, output, 2)
    pickle.dump(Y_test_matched, output, 2)
    pickle.dump(X1_test_mismatched, output, 2)
    pickle.dump(X2_test_mismatched, output, 2)
    pickle.dump(Y_test_mismatched, output, 2)
    pickle.dump(X2_two_sentences, output, 2)
    pickle.dump(X2_test_two_sentences, output, 2)
    pickle.dump(tokenizer, output, 2)

NOTE: I have received this data as-is and this was the code used to produce it. I cannot rerun the above code as these are Deep learning features that take hours to compute. Hence I wont be able to make any changes to the file variables.pkl The files size was approximately 1.93GB. After this, I wanted to update the X1_test file and X2_test file using the following code:

with open('variables.pkl', 'wb') as output:
     pickle.dump(X1_test, output, 2)
     pickle.dump(X2_test, output, 2)

My understanding was that it would just update the two files. Instead it has deleted all the files and only these two files remain. The file size is only 12.6KB now. Was wondering what I did wrong? How can I just update the said two files while keeping everything else the same.

Shawn
  • 261
  • 1
  • 7
  • 25
  • Possible duplicate of [Updating Python Pickle Object](https://stackoverflow.com/questions/36796322/updating-python-pickle-object) – rassar Oct 30 '19 at 11:39
  • Possible duplicate of [Pickle dump replaces current file data](https://stackoverflow.com/questions/20624682/pickle-dump-replaces-current-file-data) – Matthias Oct 30 '19 at 11:40

3 Answers3

3

Instead of dumping all variables direct to the file one after the other; construct a dictionary with keys indicating the variable name and values as the actual data.

Saving initially:

with open('variables.pkl', 'wb') as output:
    d = {'X1': X1, 'X2': X2}
    pickle.dumps(d, output)

Updating these values:

#Open existing
with open('variables.pkl', 'rb') as file:
    d = pickle.load(file)

#Update the values
d['X1'] = X1
d['X2'] = X2

#Save
with open('variables.pkl', 'wb') as file:
    pickle.dumps(d, file)
Rithin Chalumuri
  • 1,739
  • 7
  • 19
  • Thanks for the answer! Please check the edit. I cannot modify the ```variables.pkl``` file. Hence, I want to update the values in ```variables.pkl``` but cannot recompute the entire file – Shawn Oct 30 '19 at 13:40
3

pickle is a byte sequence that you can write to a file. You can write multiple pickles one after the other into the same file, but in general you cannot delete and replace the first pickle or any others in the middle of the file.

1.) you opened the file in mode "wb" which truncates the file. therefore all data is lost. but imagine you would it open differently with mode "r+" for example

the first pickle requires for example 300 bytes the second pickle begins thus at offset 300.

Now imagine you open the file in read/write mode, and write a new pickle at file offset 0. However unfortunately now your pickle requires now 301 bytes. You would have overwritten the first byte of the second pickled object. The second pickle would thus become unreadable and it would even be quite tricky to find position if the beginning of the third pickle object.

What you can do instead: Either perform a full read / modify write.

Meaning read back all pickles into temporary variables modify the one you want to change and write everything back.

The other alternative is to use the shelve module ( https://docs.python.org/3.6/library/shelve.html ), which behaves like a dict of pickles.

Your code could look like:

import shelve

d = shelve.open("mydata.shlv")

d['embedding_weights'] = embedding_weights
d['X1'] = X1
d['X2'] = X2
d['Y'] = Y
d['X1_test'] = X1_test
d['X2_test'] = X2_test
d['Y_test'] = Y_test
d['X1_nli'] = X1_nli
d['X2_nli'] = X2_nli
d['Y_nli'] = Y_nli
d['X1_test_nli'] = X1_test_nli
d['X2_test_nli'] = X2_test_nli
d['Y_test_nli'] = Y_test_nli
d['X1_test_matched'] = X1_test_matched
d['X2_test_matched'] = X2_test_matched
d['Y_test_matched'] = Y_test_matched
d['X1_test_mismatched'] = X1_test_mismatched
d['X2_test_mismatched'] = X2_test_mismatched
d['Y_test_mismatched'] = Y_test_mismatched
d['X2_two_sentences'] = X2_two_sentences
d['X2_test_two_sentences'] = X2_test_two_sentences
d['tokenizer'] = tokenizer
d.close()

However if you must stick to the current file format and you have enough free disk space you can do something like:

with open('variables.pkl', 'rb') as input:
  with open("newvariables.pkl", "wb") as output:
    pickle.dump(pickle.load(input), output, 2) # embedding_weights
    pickle.dump(pickle.load(input), output, 2) # X1 
    pickle.dump(pickle.load(input), output, 2) # X2
    pickle.dump(pickle.load(input), output, 2) # Y
    pickle.load(input) # read and discard X1_test
    pickle.dump(new_X1_test, output, 2)
    pickle.load(input) # read and discard X2_test
    pickle.dump(new_X2_test, output, 2)
    pickle.dump(pickle.load(input), output, 2) # Y_test ...
    pickle.dump(pickle.load(input), output, 2) # X1_nli
    ...
# now you could rename variables.pkl to oldvariables.pkl
# and rename newvariables.pkl to variables.pkl
gelonida
  • 5,327
  • 2
  • 23
  • 41
  • Thanks for the answer! Please check the edit. I cannot modify the ```variables.pkl``` file. Hence, I want to update the values in ```variables.pkl``` but cannot recompute the entire file – Shawn Oct 30 '19 at 13:41
  • You cannot just update some bytes in the middle of this pickle file, as the new pickles do not have the same size. I will enhance my answer into a version, where you create a new copy of the pickle file with the adequate changes. With the given file format you cannot do better. – gelonida Oct 30 '19 at 15:50
  • does the second part of my solution help? – gelonida Nov 04 '19 at 01:23
0

The pickle file format is a sequential format. Thus, if you change one item, at least everything behind that position in the file has to be rewritten. The wb opens a file for writing only in binary format. It overwrites the file if the file exists. If the file does not exist, creates a new file for writing.

Because the file is in binary format, I don't see a way to update the file, though appending is still possible - check the ab format.

To resolve your issue, the right way would be to unpickle the file first in read mode, load the content of your file in a Python variable, perform an update to the variable, and then open the file in wb to overwrite it. Given below is the code to perform the following.

with open('variables.pkl', 'rb') as file:
    variable_dict = pickle.load(file)

variable_dict['X1_test'] = X1
variable_dict['X2_test'] = X2

with open('variables.pkl', 'wb') as file:
    pickle.dumps(variable_dict, file)
Anant Mittal
  • 1,923
  • 9
  • 15