1

Due to several edits, this question might have become a bit incoherent. I apologize.

I'm currently writing a Python server. It will never see more than 4 active users, but I'm a computer science student, so I'm planning for it anyway.

Currently, I'm about to implement a function to save a backup of the current state of all relevant variables into CSV files. Of those I currently have 10, and they will never be really big, but... well, computer science student and so on.

So, I am currently thinking about two things:

  1. When to run a backup?
  2. What kind of backup?

When to run:

I can either run a backup every time a variable changes, which has the advantage of always having the current state in the backup, or something like once every minute, which has the advantage of not rewriting the file hundreds of times per minute if the server gets busy, but will create a lot of useless rewrites of the same data if I don't implement a detection which variables have changed since the last backup.

Directly related to that is the question what kind of backup I should do.

I can either do a full backup of all variables (Which is pointless if I'm running a backup every time a variable changes, but might be good if I'm running a backup every X minutes), or a full backup of a single variable (Which would be better if I'm backing up each time the variables change, but would involve either multiple backup functions or a smart detection of the variable that is currently backed up), or I can try some sort of delta-backup on the files (Which would probably involve reading the current file and rewriting it with the changes, so it's probably pretty stupid, unless there is a trick for this in Python I don't know about).

I cannot use shelves because I want the data to be portable between different programming languages (java, for example, probably cannot open python shelves), and I cannot use MySQL for different reasons, mainly that the machine that will run the Server has no MySQL support and I don't want to use an external MySQL-Server since I want the server to keep running when the internet connection drops.

I am also aware of the fact that there are several ways to do this with preimplemented functions of python and / or other software (sqlite, for example). I am just a big fan of building this stuff myself, not because I like to reinvent the wheel, but because I like to know how the things I use work. I'm building this server partly just for learning python, and although knowing how to use SQLite is something useful, I also enjoy doing the "dirty work" myself.

In my usage scenario of possibly a few requests per day I am tending towards the "backup on change" idea, but that would quickly fall apart if, for some reason, the server gets really, really busy.

So, my question basically boils down to this: Which backup method would be the most useful in this scenario, and have I possibly missed another backup strategy? How do you decide on which strategy to use in your applications?

Please note that I raise this question mostly out of a general curiosity for backup strategies and the thoughts behind them, and not because of problems in this special case.

malexmave
  • 1,283
  • 2
  • 17
  • 37
  • 1
    What's wrong with using `shelve`? It seems to cover your use cases. Why not simply use ordinary persistent file storage for current state? – S.Lott Feb 28 '12 at 19:12
  • @S.Lott I wasn't aware that shelve exists, so thanks for that. But I'm looking for the ability to take my files and use them at a different server, written in a different language (java), so shelves might not be the best idea. Should have written that, will edit to reflect this. – malexmave Feb 28 '12 at 19:14
  • Have you thought about using sqlite? It seems attractive for this situation since you have 1)Atomic updates. 2)It's a single file database that can be moved around. 3)Python standard library support (sqlite3) 4)The possibility of accessing data from Java 5)faster than writing to csv. http://www.sqlite.org/ – Wilduck Feb 28 '12 at 19:31
  • An **Edit:** is not really the best way to edit a question. Actually put the information in where it belongs (up front) not at the end. You're reinventing the wheel, here, and it's very important to do a lot more reading about persistent storage. There are probably hundreds of packages that offer persistent state through the file system. It's the reason we have a file system. Please edit the title to more accurately describe what you're doing. It's just persistent state. It's not a database backup; that has very specific connotation and your question doesn't the meaning of the phrase. – S.Lott Feb 28 '12 at 19:33
  • @Wilduck I was planning on using sqlite, but I am running the server on a pretty limited system which has no support for mySQL, and I don't want to use an external SQL since it should keep working when the internet is down for me, but I don't have a local machine that is running 24/7 and is able to run SQL. – malexmave Feb 28 '12 at 19:39
  • 1
    SQLite has nothing to do with MySQL. Why introduce MySQL? – S.Lott Feb 28 '12 at 19:41
  • 1
    @S.Lott Huh... I was thinking about the Python MySQL import. Right, thats not SQLite, that MySQLdb. I should _really_ try to get some sleep before asking questions here... Sorry. – malexmave Feb 28 '12 at 19:46
  • @S.Lott I have edited the OP to answer your comment about reinventing the wheel. – malexmave Feb 28 '12 at 19:54

2 Answers2

2

Use sqlite. You're asking about building persistent storage using csv files, and about how to update the files as things change. What you're asking for is a lightweight, portable relational (as in, table based) database. Sqlite is perfect for this situation.

Python has had sqlite support in the standard library since version 2.5 with the sqlite3 module. Since a sqlite database is implemented as a single file, it's simple to move them across machines, and Java has a number of different ways to interact with sqlite.

I'm all for doing things for the sake of learning, but if you really want to learn about data persistence, I wouldn't marry yourself to the idea of a "csv database". I would start by looking at the wikipedia page for Persistence. What you're thinking about is basically a "System Image" for your data. The Wikipedia article describes some of the same shortcomings of this approach that you've mentioned:

State changes made to a system after its last image was saved are lost in the case of a system failure or shutdown. Saving an image for every single change would be too time-consuming for most systems

Rather than trying to update your state wholesale at every change, I think you'd be better off looking at some other form of persistence. For example, some sort of journal could work well. This makes it simple to just append any change to the end of a log-file, or some similar construct.

However, if you end up with many concurrent users, with processes running on multiple threads, you'll run in to concerns of whether or not your changes are atomic, or if they conflict with one another. While operating systems generally have some ways of dealing with locking files for edits, you're opening up a can of worms trying to learn about how that works and interacts with your system. At this point you're back to needing a database.

So sure, play around with a couple different approaches. But as soon as you're looking to just get it working in a clear and consistent manner, go with sqlite.

Community
  • 1
  • 1
Wilduck
  • 13,822
  • 10
  • 58
  • 90
  • Thanks for the in-depth answer. I will see if I will implement it all myself or just go with SQLite in the end. I am actually curious about implementing the journal-like system, that sounds like a good combination of speed and persistance. So I might implement it for fun and then use SQLite in the end. Got to check if my target machine can use SQLite, but I am pretty sure it can. – malexmave Feb 29 '12 at 09:06
1

If your data is in CSV files, why not use a revision control system on those files? E.g. git would be pretty fast and give excellent history. The repository would be wholly contained in the directory where the files reside, so it's pretty easy to handle. You could also replicate that repository to other machines or directories easily.

Roland Smith
  • 42,427
  • 3
  • 64
  • 94
  • My question was not what to do with the CSV files, but when and how to generate them. I don't really need a full history of all server states. Still, thanks for trying to help. – malexmave Feb 28 '12 at 20:59
  • Ok. Why not keep one set of files of the initial state (only read at startup), and a second set of files that only record changes. You can keep the second set of files open in append mode, and write out every transaction as it occurs. When the program stops, it should write a complete set of initial files, and delete the second set. – Roland Smith Feb 28 '12 at 21:09
  • That might actually work. I would need to add a check at startup for the "changes"-file (in case the server crashed and did not terminate properly), but other than that, it seems like a good plan if I stick to my idea of not wanting to implement everything myself. Thanks – malexmave Feb 29 '12 at 09:01