18

I've been trying to learn programming for a while. I've studied Java and Python, and I'm comfortable with their syntax. Recently, I wanted to use what I've learnt with coding a tangible software from ground up.

I want to implement a database engine, sort of a NoSQL database. I've put together a small document, sort of a specification to follow throughout my adventure of coding it. But all I know is a bunch of keywords. I don't know where to start.

Can someone help me find out how to gather the knowledge I need for this kind of work and in what order to learn things? I have searched for documents, but I feel like I'll end up finding unrelated/erroneous content or start from a wrong point, because implementing a complete database engine is (seeming to be) a truly complicated task.

I wan't to express that I'd prefer theses and whitepapers and (e)books to codes of other projects, because I've asked a question of kind in which people usually get answered in the form of "read project - x' source code". I'm not at the level of comfortably reading and understanding source code.

  • 2
    If I were you, I'd *definitely* be starting at a much smaller scale for your first real programs. Learning syntax has almost nothing to do with the practical skills of thinking logically and procedurally like a computer, and how to break down large chunks into smaller actionable tasks. – mellamokb Nov 01 '12 at 13:20
  • 3
    Don't let big code bases intimidate you! They can be hard to figure out even for experienced developers, but you can learn a great deal from studying other people's code. You might spend some time figuring out what all the different modules are, or you might pick one interesting module to study. Either way, you're bound to learn a lot. – Martin Ellis Nov 01 '12 at 13:22
  • @mellamokb That is what I actually need to learn, but I'm self-teaching myself things, so I'm being afraid of learning untrue information and complicationg myself. –  Nov 01 '12 at 13:26

2 Answers2

20

First, you may have a look that the answers for How to write a simple database engine. While it focus on a SQL engine, there is still a lot of good material in the answers.

Otherwise, a good project tutorial is Implementation of a B-Tree Database Class. The example code is in C++, but the description of what is done and why is probably what you'll want to look at anyway.

Also, there is Designing and Implementing Structured Storage (Database Engine) over at MSDN. Plenty of information there to help you in your learning project.

Community
  • 1
  • 1
Laurent Parenteau
  • 2,516
  • 20
  • 31
  • Thank you for answer. I've checked sqlite, they're releasing sqlite source in a so called *amalgamated* state, in which whole c code for the software is in a single ~5mb file, which is impossible to understand anything from. I't is heavily commented code though. BtreeDB looks like a good resource to me, I'll try to go over C++ syntax a little bit to understand easily. –  Nov 01 '12 at 15:40
  • @g_kaya Do not forget to look at all the answers for 'How to write a database engine'. Others there suggest books, projects, etc. – Laurent Parenteau Nov 01 '12 at 17:52
19

Because the accepted answer only offers (good) links to other resources, I'd thought I share my experience writing webdb, a small experimental database for browsers. I also invite you to read the source code. It's pretty small. You should be able to read through it and get a basic understanding of what it's doing in a couple of hours. Warning: I am a n00b at this and since writing it I learned a lot more about it and see I have been doing some things wrong. It can help you get started though.

The basics: BTree

I started out with adapting an AVL tree to suit my needs. An AVL tree is a kind of self-balancing binary search tree. You store the key K and related data (if any) in a node, then all items with key < K in a node in the left subtree and all items with key > K in a right subtree. You can use an array to store the data items if you want to support non unique keys.

This tree will give you the basics: Create, Update, Delete and a way to quickly get an item by key, or all items with key < x, or with key between x and y etc. It can serve as the index for our table.

A schema

As a next step I wrote code that lets the client code define a schema. Methods like createTable() etc. Schemas are typically associated with SQL, but even no-SQL sort-of has a schema; they usually require you to mark the ID field and any other fields you want to search on. You can make your schema as fancy as you want, but you typically want to model at least which column(s) serve as primary key and which fields will be searched on frequently and need an index.

Creating a data structure to store a table

I decided to use the tree I created in the first step to store my items. These were simple JS objects. Having defined which field contains the PK, I could simply insert the item into the tree using that field's value as the key. This gives me quick lookup by ID (range).

Next I added another tree for every column that needs an index. In these trees I did not store the full record, but only the key. So to fetch a customer by last name, I would first use the index on last name to get the ID, then the primary key index to get the actual record. The reason I did not just store the (reference to the) actual object is because it makes set operations a little bit simpler (see next step)

Querying

Now that we have a table with indexes for PK and search fields, we can implement querying. I did not take this very far as it becomes complicated quickly, but you can get some nice functionality with just some basics. WebDB does not implement joins; all queries operate only on a single table. But once you understand this you see a pretty clear (though long and winding) path to doing joins and other complicated stuff as well.

In WebDB, to get all customers with firstName = 'John' and city = 'New York' (assuming those are two search fields), you would write something like:

var webDb = ...
var johnsFromNY = webDb.customers.get({
  firstName: 'John',
  city: 'New York'
})

To solve it, we first do two lookups: we get the set X of all IDs of customers named 'John' and we get the set Y of all IDs of customers from New York. We then perform an intersection on these two sets to get all IDs of customers that are both named 'John' AND from New York. We then run through our set of resulting IDs, getting the actual record for each one and adding it to the result array.

Using the set operators like union and intersection we can perform AND and OR searches. I only implemented AND.

Doing joins would (I think) involve creating temporary tables in memory, then populating them as the query runs with the joined results, then applying the query criteria to the temp table. I never got there. I attempted some syncing logic next but that was too ambitious and it went downhill from there :)

Stijn de Witt
  • 40,192
  • 13
  • 79
  • 80
  • Hi Stijn, you saved the data in a file in AVL tree structure? How do you retrieve the data? Do you parse the whole file each time? Could you show a pseudo code please? – Miko Chu Sep 13 '22 at 11:29