1

I am designing a database to hold past and future data of a scientific research project. I tried to simplify the discussion by removing all non-database-related terms :)

Description

Let's say we have a collection of Foo and Bar objects. Each Bar is a child of a Foo, in the sense that it belongs to it. Both Foos and Bars can have an unlimited number of properties (attributes). Different properties can have different data types, which can be numbers, texts, images, files etc. Moreover, each property of each object has an history of values, which must be stored in the database.

I want to be able to add new properties without editing the structure of the database. I am not forced to use any particular database software, but I would like to have a Python interface to it, since everybody in the group can use Python. Finally, it would useful if the database was file-based.

Example

This is a diagram to better represent the description above. Hope it helps.

example of data and data structure

My attempt

I decided to try with SQLite, to have file-based database that was easily accessible with Python. This is the database schema I adopted:

database schema

I am using Python's sqlite3 to connect to it and pickle to put the different type of data inside the BLOB value column. The disadvantage with of this approach is of course that SQLite has no idea of what's inside value, therefore I need to retrieve everything every time to run searches, and I can't take advantage of SQL queries.

The question, finally

Is there a better (e.g. easier to code || faster || already available || ...) solution to this problem? Again, I am not bound to any particular kind of database. I am a physicist with a little background in computer science, so any help is appreciated.

  • if you don't have experience in computer science, here is the easier way to handle that, first create a spreadsheet with example data in non normalize way with duplication, than try to normalize to remove duplication. Things like this can be resolved by just adding a type field say foo/bar and a parent field but I'm confused by your design. If you provide sample data in one sheet, I'll be able to normalize it for you. – vimdude Jan 27 '18 at 18:27

1 Answers1

2

RDF stores does exactly what you need natively. So your basic "schema" with data in Turtle RDF format will look as follows:

# Prefixes
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix : <http://stackoverflow.com/q/48479002/database-design-for-non-homogeneous-data-with-history-of-values/>

:Foo rdf:type rdfs:Class .
:Bar rdfs:subClassOf :Foo .

:Bar_97 rdf:type :Foo .
:Bar_97 :prop1 "20141201" .
:Bar_97 :prop1 "20141202" .
:Bar_97 :prop1 "20141209" .
:Bar_97 :prop173 "Some interesting property" .

As you can see you can assign multiple values to the same property, i.e. the prop1 property for Bar_97 has the values 20141201, 20141202 and 20141209.

For a list of RDF stores see RDF store implementations. To access an RDF store from Python you can use RDFlib. To ensure that RDFlib returns all children of Foo (including Bar), see RDFlib reasoning problem.

As for the RDF store being file based, the stores are often stored as multiple files for ensuring efficiency of queries. However, you should be able to export the data into Turtle (or some other human readable) format.

A last observation is that RDF allows you to infer information that has not been explicitly stated in your data from what has been explicitly stated. I.e. from x rd:type :Foo it can infer that x is a Bar as well. However, RDFlib does not support reasoning, hence the reason for the RDFlib reasoning problem workaround, which works, but the moment you may need additional reasoning capability it may not be possible to address using RDFlib. I am not aware of Python implementations that implements reasoning. For this reason you may want to consider Java implementations like Jena or RDF4J.

Henriette Harmse
  • 4,167
  • 1
  • 13
  • 22