PyYAML file efficient management

Question

I am writing a Python program that maintains a list of contacts, each having 3 fields:

Name
Phone Number
Email

Contacts need to be saved in a YAML structured file and the program is supposed to provide the facility of adding new contacts.

My Code for this is:

class contacts:
    def add_contact(self,file,contact):
        if not os.path.exists(file):
            #Creating for the first time
            temp = []
            temp.append(contact)
            with open(file, "w") as file_desc:
                yaml.dump(temp, file_desc, default_flow_style=False)
            file_desc.close()
        else:
            #Second onwards
            with open(file, "r") as file_desc:
                loaded = yaml.safe_load(file_desc)
                loaded.append(contact)
                with open(file, "w") as file_desc2:
                    yaml.dump(loaded, file_desc2, default_flow_style=False)
                    file_desc2.close()
            file_desc.close()

if __name__ == "__main__":

    data1 = {'name' :'Abcd', 'phone': 1234, 'email': 'abcd@gmail.com'}
    data2 = {'name': 'efgh', 'phone': 5678, 'email': 'efgh@gmail.com'}
    contact = contacts()
    contact.add_contact("contacts.yaml", data1)
    contact.add_contact("contacts.yaml",data2)

I think this is an inefficient implementation. If we have 1 million contacts, and we want to add a new one, this will first read all of them, append one to the list and write all the 1 Million + 1 contacts again. Is there a way to just add the new contacts without having to write the whole file again. I guess reading is important as I don't want to store duplicate contacts and that would need comparison. Any other efficient approach would also be appreciated.

yaml really isn't meant for data storage. Can you not use a sqlite3 DB instead? Then you can have suitable indices for your columns, any appropriate unique constraints to prevent inserting duplicates and also update/delete existing rows easily and adding new contacts won't have such a massive overhead. — Jon Clements, Sep 16 '18 at 08:42
Just a technical note: There is no need to call `file_desc.close()`. The whole point of using `open()` with the `with` statement is that it closes the file automatically after exiting the context of the `with` block. — Iguananaut, Sep 16 '18 at 08:47
I was going to make the same comment. You didn't say anything about why you have the requirement of "Contacts need to be saved in a YAML structured file". YAML is a serialization and exchange format, it shouldn't be used for large collections of records--depending on the size of the record if you even have to store hundreds it's not the right technology for that purpose... — Iguananaut, Sep 16 '18 at 08:49
...that said, from your example code I see no reason you need to read in the entire file just to append a record. Since your top-level data structure is a list, a nice thing about the YAML format is you can easily append another list item (at least block-level lists, as opposed to inline lists that use the bracketed `[...]` notation). So if you open the file in append mode (`open(..., 'a')`) and write a single-element list to the end of the file, it should preserve the list structure. — Iguananaut, Sep 16 '18 at 08:53
Better still, YAML has a notion of "documents" and it's possible to write multiple documents to a single file. For something where each "contact" is an individual record it doesn't necessarily make sense for each one to be stored in a YAML list. Rather, each one can be stored in a single document. Those documents can all go in a single file or, often better, in separate files named by same unique key that let's you look up a contact by that key. But really you're better off using a nosql database. — Iguananaut, Sep 16 '18 at 08:56

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

In a long-running program/process P there is indeed no need to re-read the data. There are a few things to keep in mind:

If you only use the YAML document in other programs when P has stopped, then you only need to write out the file when P exits. You might want to do so using atexit, if you don't have a single exit points
If other programs might edit/update the list while P is running, then make sure that you check the datetime stamp of the YAML file and re-read the file before adding a new contact. If necessary, you can work with locks to make sure only one program at a time, updates the file.
If other programs need to have an up-to-date YAML document you can either write the YAML out on each update, or you can use some mechanism to notify P that the YAML document needs to be written. I have used both SIGINT handling and zeromq based communications to do so.

A lot of the above is done for you if you use a real database, and for a simple table of records, that all have the same fields, that might be a better alternative. However as soon as things get more complex: different fields per record, complex and possible recursive data, then a lot of (SQL) databases become an additional problem, instead of helping solve the one you try to tackle.

ruamel.yaml.base (disclaimer: I am the author of that package) does item 2) for you out-of-the-box, the other two items are easily implemented as well. The only tricky thing is that the YAMLBase normally expects a mapping/dict at the root level for a new file, so some coercion needs to take place when the file doesn't exists yet.

After you do pip install ruamel.yaml.base:

import os
import ruamel.yaml
from ruamel.yaml.base import YAMLBase

yaml_path = 'contacts.yaml'

class Contacts(YAMLBase):
   def __init__(self, path=yaml_path, verbose=0):
       self._create_ok = True  # so the file is auto created if it doesn't exists
       super().__init__(path=path, verbose=verbose)
       if not os.path.exists(yaml_path):
           # this is necessary to force block style sequence at the top
           self._data = ruamel.yaml.comments.CommentedSeq()
           self._changed = True

   def add_record(self, contact):
       self.data.append(contact)
       self._changed = True  # this signals that writing is necessary

   def dump_file(self):
       """dump the contents of the file on disc"""
       print('dumping: "{}"'.format(self._path))
       with open(yaml_path) as fp:
           print(fp.read(), end='')



data1 = {'name' :'Abcd', 'phone': 1234, 'email': 'abcd@gmail.com'}
data2 = {'name': 'efgh', 'phone': 5678, 'email': 'efgh@gmail.com'}

contacts = Contacts()
contacts.add_record(data1)
contacts.save()  # optional
contacts.dump_file()

# this is just for checking 

contacts.add_record(data2)
contacts.save()
contacts.dump_file()

which gives:

dumping: "contacts.yaml"
- name: Abcd
  phone: 1234
  email: abcd@gmail.com
dumping: "contacts.yaml"
- name: Abcd
  phone: 1234
  email: abcd@gmail.com
- name: efgh
  phone: 5678
  email: efgh@gmail.com

If you set the verbose parameter to 1, you'll get some information on stdout about what is going on in the package.

If you have a lot of record then you might want to change self.data in Contacts to self.fast_data, this will then load the YAML using the much faster C based loader, at the expense of not being able to preserve (hand added) comments etc. in the input YAML. (In either case a "safe_load" is being used).

I just realise the ruamel.yaml.base repoistory has not been pushed to bitbucket yet, I'll try to remedy that soon. You can of course look at the source of the installed package. — Anthon, Sep 16 '18 at 10:17

PyYAML file efficient management

1 Answers1