2

I'm trying to develop a system that will allow users to update local, offline databases on their laptops and, upon reconnection to the network, synchronize their dbs with the main, master db.

I looked at MySQL replication, but that documentation focuses on unidirectional syncing. So I think I'm going to build a custom app in python for doing this (bilateral syncing), and I have a couple of questions.

I've read a couple of posts regarding this issue, and one of the items which has been passively mentioned is serialization (which I would be implementing through the pickle and cPickle modules in python). Could someone please tell me whether this is necessary, and the advantages of serializing data in the context of syncing databases?

One of the uses in wikipedia's entry on serialization states it can be used as "a method for detecting changes in time-varying data." This sounds really important, because my application will be looking at timestamps to determine which records have precedence when updating the master database. So, I guess the thing I don't really get is how pickling data in python can be used to "detect changes in time-varying data", and whether or not this would supplement using timestamps in the database to determine precedence or replace this method entirely.

Anyways, high level explanations or code examples are both welcome. I'm just trying to figure this out.

Thanks

fromabove
  • 47
  • 2
  • 5
  • A few tangential notes: 1) Be aware that the pickle module makes no security guarantees whatsoever; if untrusted sources will be creating data (directly or indirectly), you will want to use something like JSON. 2) When using timestamps, it is not unlikely (in fact sometimes very likely, due to batching) that many entries will have the exact same timestamp (down the to millisecond, maybe even microsecond). Your code should not fail in this case; you may require a vector clock to replace indices, or to modify your semantics. – ninjagecko Mar 22 '12 at 03:04
  • Note that, in general, serialising objects for a database is a bad idea - it goes against the principles of normalisation for a database, and restricts how you access the data, and what you can use to access the data. You are almost always better off storing it in a database properly. – Gareth Latty Mar 22 '12 at 03:06

1 Answers1

0

how pickling data in python can be used to "detect changes in time-varying data."

Bundling data in an opaque format tells you absolutely nothing about time-varying data, except that it might have possibly changed (but you'd need to check that manually by unwrapping it). What the article is actually saying is...

To quote the actual relevant section (link to article at this moment in time):

Since both serializing and deserializing can be driven from common code, (for example, the Serialize function in Microsoft Foundation Classes) it is possible for the common code to do both at the same time, and thus 1) detect differences between the objects being serialized and their prior copies, and 2) provide the input for the next such detection. It is not necessary to actually build the prior copy, since differences can be detected "on the fly". This is a way to understand the technique called differential execution[a link which does not exist]. It is useful in the programming of user interfaces whose contents are time-varying — graphical objects can be created, removed, altered, or made to handle input events without necessarily having to write separate code to do those things.

The term "differential execution" seems to be a neologism coined by this person, where he described it in another StackOverflow answer: How does differential execution work?. Reading over that answer, I think I understand what he's trying to say. He seems to be using "differential execution" as a MVC-style concept, in the context where you have lots of view widgets (think a webpage) and you want to allow incremental changes to update just those elements, without forcing a global redraw of the screen. I would not call this "serialization" in the classic sense of the word (not by any stretch, in my humble opinion), but rather "keeping track of the past" or something like that. Because this basically has nothing to do with serialization, the rest of this answer (my interpretation of what he is describing) is probably not worth your time unless you are interested in the topic.


In general, avoiding a global redraw is impossible. Global redraws must sometimes happen: for example in HTML, if you increase the size of an element, you need to reflow lower elements, triggering a repaint. In 3D, you need to redraw everything behind what you update. However if you follow this technique, you can reduce (though not minimize) the number of redraws. This technique he claims will avoid the use of most events, avoid OOP, and use only imperative procedures and macros. My interpretation goes as follows:

  • Your drawing functions must know, somehow, how to "erase" themselves and anything they do which may affect the display of unrelated functions.
  • Write a sideffect-free paintEverything() script that imperatively displays everything (e.g. using functions like paintButton() and paintLabel()), using nothing but IF macros/functions. The IF macro works just like an if-statement, except...
  • Whenever you encounter an IF branch, keep track of both which IF statement this was, and the branch you took. "Which IF statement this was" is sort of a vague concept. For example you might decide to implement a FOR loop by combining IFs with recursion, in which case I think you'd need to keep track of the IF statement as a tree (whose nodes are either function calls or IF statements). You ensure the structure of that tree corresponds to the precedence rule "child layout choices depend on this layout choice".
  • Every time a user input event happens, rerun your paintEverything() script. However because we have kept track of which part of the code depends on which other parts, we can automatically skip anything which did not depend on what was updated. For example if paintLabel() did not depend on the state of the button, we can avoid rerunning that part of the paintEverything() script.

The "serialization" (not really serialization, more like naturally-serialized data structure) comes from the execution history of the if-branches. Except, serialization here is not necessary at all; all you needed was to keep track of which part of the display code depends on which others. It just so happens that if you use this technique with serially-executed "smart-if"-statements, it makes sense to use a lazily-evaluated diff of execution history to determine what you need to update.

However this technique does have useful takeaways. I'd say the main takeaway is: it is also a reasonable thing to keep track of dependencies not just in an OOP-style (e.g. not just widget A depends on widget B), but dependencies of the basic combinators in whatever DSL you are programming in. Also dependencies can be inferred from the structure of your program (e.g. like HTML does).

Community
  • 1
  • 1
ninjagecko
  • 88,546
  • 24
  • 137
  • 145
  • thanks for the answer. DE is an interesting concept, and even though your answer focuses on ui programming, I think the main idea can still apply to my problem of syncing databases. i.e. I need to avoid "redrawing", or re-updating entire records in the master db when changes are made to the local (slave) dbs, and only update the relevant fields in each record. the problem is, I still don't know how to determine which changes are the newest changes. I can serialize and compare the data, as mentioned above, but this will only tell me that the data has been changed ... – fromabove Mar 22 '12 at 14:28
  • I can also use vector clocks, as you mentioned in your comment to the original question, but if there are two offline dbs updating the same record, then wouldn't they need to share the vector clock? Also, your answer is very thorough, but could you please show me a simple code example of using a vector clock? my understanding is it's almost like a global counter, (i=0), and each process or update will increment this counter by 1 (i++). Then, the newest changes to a record will correspond to the highest value for the record's vector clock. Is that essentially it? – fromabove Mar 22 '12 at 14:31