10

This question is more on architecture and libs, than on implementation.

I am currently working at project, which requires a local long-term cache storage (updated once a day) at client kept in sync with a remote db at server. For client side sqlite has been chosen as a lightweight approach and postgresql as feature rich db at server. Native replication mechanisms of postgres are no-opt cause I need to keep client really lightweight and free of relying on external components like db servers.

The implementation language would be Python. Now I'm looking at ORMs like SQLAlchemy, but haven't worked with any before.

  • Does SQLAlchemy have any tools to keep sqlite and postgres dbs in sync?
  • If not, are there any other Python libraries which have such tools?
  • Any ideas about how should the architecture look like, if the task must be solved "by hand"?

Added: It's like telemetry, cause client would have internet connection only for approximately 20 minutes a day

So, the main question is about architecure of such a system

thodnev
  • 1,564
  • 16
  • 20
  • 2
    this question is far too broad. – e4c5 Dec 23 '16 at 02:05
  • A side note. If there's a place for trade-off at your table, you can look at CouchDB which advertises as offline-first database (document though, not relational). Or at least to look for ideas in their docs about [eventual consistency](http://docs.couchdb.org/en/2.0.0/intro/consistency.html#eventual-consistency) (and possibly implement it over relational database). In my experience it's too restrictive and inconvenient, but your mileage may vary. – saaj Dec 28 '16 at 16:09
  • is this a one way replication or 2 way replication? ie is the database on the client read only? If it is only updating the database on the client, then it is much simpler. No collision. If it is both clients and server make modification to database, can you partition the data per client? The reason I ask this is to see if there is ways to avoid collisions where 2 clients are updating the same row offline and then both upload - how would you reconcile that – Youn Elan Dec 28 '16 at 23:07
  • @YounElan yes, the replication is 2-side so collisions are possible. Now I'm looking at solving this problem by designing my db like a table with all transactions list and materialized view, that represents totals – thodnev Jan 01 '17 at 12:07

1 Answers1

8

It doesn't usually fall within the tasks of an ORM to sync data between databases, so you will likely have to implement it yourself. I don't know of any solution that will handle syncing for you given your choice of databases.

There are a couple important design choices to consider:

  • how do you figure out what data changed ( i.e. inserted, updated or deleted )
  • what is the most efficient way to package the change-log
  • will you have to deal with conflicts ? and how will you do that.

The most efficient way to figure out what changed is to have the database tell you that directly. Bottled water can offer some inspiration in this regard. The idea is to tap into the event log postgres would use for replication. You will need something like Kafka to keep track of what each of your clients already knows. This will allow you to optimize your server for writes, as you won't have clients querying trying to figure out what changed since they were last online. The same can be achieved on the sqlight end with event callbacks, you'll just have to trade some storage space on the client to retain the changes to be sent to the server. If that sounds like too much infrastructure for your needs, it's something that you can easily implement with SQL and pooling as well, but I would still think of it as an event log, and consider how it's implemented a detail - possibly allowing for a more efficient implementation lather on.

The best way to structure and package your change log will depend on your applications requirements, available band-with, etc. You could use standard formats such as json, compress and encrypt if needed.

It will be much simpler to design your application as such to avoid conflicts, and possibly flow data in a single direction, or partition your data so that it always flows in a single direction for a specific partition.

One final taught is that with such an architecture you would be getting incremental updates, some of which might be missed for unplanned reasons ( system failure, bugs, dropped messages, etc ). You could have some built in heuristic to check that your data matches, like at least checking the number of records on each side, with some way to recover such a fault, at a minimal a way to manually re-fetch the data from the authoritative source, i.e. if the server is authoritative, the client should be able to discard it's data and re-fetch it. You might need such a mechanism anyway for cases wen the client is reinstalled, etc.

Alpar
  • 2,807
  • 23
  • 16