2

If I have a fairly large data structure, like a list or a dictionary, and I load it from a pickle file into Python and then only modify one or two records, can I update only those records in the file or do I have to write the entire data structure back to the file? The idea is to avoid excessive and unnecessary hard drive activity, especially writing.

If I can't do that, I guess I need to upgrade to a database?

UPDATE: I tried @Pynchia's recommendation to use the shelve module, and it does the job of storing and modifying the data. I only need to confirm that when I modify a single phone number field, that only that one field, or at the most that one record, is written to disk, and not the whole dataset. Is it so or not? That is the question.

import shelve

s = shelve.open('test.dat')
for i in range(3):
    record = {'name': 'ABC'+str(i), 'phone': ((str(i)*3)+'-'+(str(i)*4)),
              'addr': (str(i)*3)+' Main St'}
    s[str(i)] = record
s.close

s = shelve.open('test.dat')
for i in range(3):
    print(s[str(i)])
s.close

s = shelve.open('test.dat')
temp = s['1']
temp['phone']='1-800-GET-PYTHON'
s['1']=temp
s.close

print()
s = shelve.open('test.dat')
for i in range(3):
    print(s[str(i)])
s.close

Output:

{'name': 'ABC0', 'addr': '000 Main St', 'phone': '000-0000'}
{'name': 'ABC1', 'addr': '111 Main St', 'phone': '111-1111'}
{'name': 'ABC2', 'addr': '222 Main St', 'phone': '222-2222'}

{'name': 'ABC0', 'addr': '000 Main St', 'phone': '000-0000'}
{'phone': '1-800-GET-PYTHON', 'addr': '111 Main St', 'name': 'ABC1'}
{'name': 'ABC2', 'addr': '222 Main St', 'phone': '222-2222'}
mcu
  • 3,302
  • 8
  • 38
  • 64
  • Please provide a concrete example. Have you had a look at [shelve](https://pymotw.com/2/shelve/)? Maybe it could help – Pynchia Oct 25 '15 at 20:31
  • @Pynchia Please see my updated post. – mcu Oct 25 '15 at 21:11
  • thank you for the data. Shelve is built on top of `pickle` so I don't think it would be able to satisfy your requirement. I am not a pickle/shelve expert, let's wait for the gurus. I have suggested it because it wasn't clear (to me) if you wanted to update a whole record in a sequence of records or a field/attribute only within a single record. – Pynchia Oct 25 '15 at 21:23
  • It is ok to update a whole record, rather than a field, so long as it does not write the whole dataset back to disk. Does the `shelve` state anywhere that it does so? – mcu Oct 25 '15 at 21:26
  • BTW, if you plan to move to a DB, consider [MongoDB](https://www.mongodb.org/). It's a noSQL `document-oriented` DBMS, excellent for storing data structures in one go – Pynchia Oct 25 '15 at 21:27
  • have you read [this SO QA](http://stackoverflow.com/questions/14668475/pickle-versus-shelve-storing-large-dictionaries-in-python) – Pynchia Oct 25 '15 at 21:29
  • I think this. If `shelve` will let you load individuals records from a dataset, without loading the whole dataset into memory, then it is reasonable to think that it could not possibly write the entire dataset back to disk, because it does not have it loaded. So I believe that the `shelve` module will work here. – mcu Oct 25 '15 at 21:43

1 Answers1

2

The pickle file format is a sequential format. Thus, if you change one item, at least everything behind that position in the file has to be rewritten.

Unfortunately, I dont't know and also can't imagine any possibility how updating a single item should work.

Depending on the structure of your data I see two possibilities:

  1. data that can be represented as rows and contains only a low amount of data per field => use a database like sqlite (there are many other databases, some document-oriented, some act like a dictionary)
  2. few, large datasets => use a HDF5 container file. HDF5 is meant for storing large datasets and only accessing the necessary parts
Sven Rusch
  • 1,357
  • 11
  • 17
  • I will look into the HDF5 file when I have more time. Thank you for the tip. – mcu Oct 25 '15 at 21:29
  • Now that you added your example dataset, I don't think HDF5 is right for the job. It's meant for really large numerical datasets. A relational database like sqlite is the perfect solution for your dataset as long as the number of fields is fixed. Otherwise I would suggest MongoDB which has really good Python bindings. The advantage of sqlite however is that it is part of the Python standard library and does not require a server. – Sven Rusch Oct 25 '15 at 21:45