0

I'm fairly new to the realm of relational database design, and am trying to store a 9x9 covariance matrix in a table. The rows and columns are x, y, z terms of position, velocity, and acceleration. Like so:

        PosX PosY PosZ . . . AccZ
      -------------------------
PosX  |  XX   XY   XZ  . . .
PosY  |  YX   YY   YZ  . . . 
PosZ  |  ZX   ZY   ZZ  . . . 
.     |  .    .    .
.     |  .    .    .
.     |  .    .    .
AccZ  |

So for instance, the upper-left most element is PosXPosX (shortened to XX above), to the right is PosXPosY, and so on. The matrix is symmetric along the diagonal (i.e. PosXPosY == PosYPosX). It's also possible that I will want to store a 6x6 matrix that includes only position and velocity in this same table.

From my research, I've found a normalized table design of creating a table with fields for row number, column number, and value (How to represent a 2-D data matrix in a database). I can see that the benefit to this is flexibility, since the number of rows and columns can be variable. Is this the best way to proceed, even though I have a set number of rows and columns (9x9 and/or 6x6)? I could also envision creating a table that has fields for each unique row/col combination (PosXPosX, PosXPosY . . . etc). That seems more intuitive to me, but like I said I'm new at this.

My question is: How would you suggest representing my data in a relational database? I've outlined two possible methods but I don't know if either is the best way. "Best" in my case will mean efficiently stored and retrieved. What I'm creating is really a data repository, so the data in the database will not be changing once it is added, only read into numpy arrays or similar.

Some more background:
I'm analyzing test data. I have multiple test runs with different configurations, each having multiple data points that include a lot of different kinds of data. One of the data points I want to store and analyze is covariance. The amount of data I'm dealing with is quite staggering, so I'm hoping that using a database will help me to keep things organized and accessible. The goal is to store all this data, and then write data analysis and visualization tools that will draw from the data. In the case of covariance, I'm calculating things like Mahalanobis Distance, trace, and time propagated eigenvalues. I have many of these tools already, but they currently pull from a lot of different log files and are generally a mess.

Community
  • 1
  • 1
IanVS
  • 3,497
  • 1
  • 20
  • 23
  • Databases are awful at matrix calculations. I'd suggest not to store this in a database at all. – Andomar Jun 04 '12 at 17:59
  • 1
    What do you want to do with it? – Gordon Linoff Jun 04 '12 at 18:00
  • @Andomar: Thanks, I'm not really trying to do calculations in the database, I just want to store the data so I can pull it out and do calculations in a separate tool (Python, in my case). – IanVS Jun 04 '12 at 18:08
  • Will you assume that it will always be 9X9 is the max size that you will need? One thing you also could consider is using a csv file though if there is a lot of read/write contention this is probably more work than its worth. – John Kane Jun 04 '12 at 18:23
  • @JohnKane: Yes, that is the max I will ever need. (dangerous words, I know) – IanVS Jun 04 '12 at 18:25
  • Is it also safe to assume that one the data is written you will not be modifying it again while you are doing your analysis? – John Kane Jun 04 '12 at 18:29
  • Exactly. Once the data is there, it's set. It won't have to change. I just want to be able to read it back out (into a numpy array) and do stuff with it. – IanVS Jun 04 '12 at 18:35

2 Answers2

0

Databases are great at storing staggering amounts of data. Seems like you'll want to use them to quickly, clearly, and readily store and retrieve your information, if not to perform the actual calculations. If so, then you'll want to design your storage for efficient retrieval.

Fields (table columns) for rows and columns would seem to be a must. You would need to be sure never to add values greater than your matrix size (no row/col over 6 or 9). There are tricks you can do within the RDBMS to ensure this never happes, but they can get kind of kludgy.

You will find an additional field to uniquely identify which set of data is being stored essential. Build an index on the table for that column, and retrieving one set of 36 or 81 values from millions will be extremely fast.

Your second idea (one row per observation) could also work. They key thing to consider is: how will the data be used? What will be most efficient, or best satisfy your goals, in the long run?

Philip Kelley
  • 39,426
  • 11
  • 57
  • 92
  • Thanks. I see your point about another identifying field. I'm guessing after I have that, I can create a primary key out of the identifier, row, and column fields. Ultimately I want to be able to quickly and easily write/read from the database to a numpy array. – IanVS Jun 04 '12 at 18:34
  • That would work as a primary key. Note, however, that an index containing a bajillion rows whose values range between only from 1 and 6 (or 9) won't help at all with performance. (Make the identifying field the first in the index, and it'll be fine.) – Philip Kelley Jun 04 '12 at 19:45
0

As long as your data-set is small and you can assume that the values once read will not be modified by an external application, it might be worth considering using a csv file and just read the data into whatever data structure will be the most useful while you are doing your analysis. This will also let you look at your data slightly easier too because you will only need a text editor or if you wanted a way to view it as a spreadsheet.

Based on what you have said, it seems that the cleanest thing to have one record per cell because this will give you the most flexibility in the future. If you are interested in reading about database design this is a pretty good starting place

One possible design could be:

table matrix(record_id, parent_id, matrix_id, x, y, value) 

where record_id uniquely identifies a record, parent_id is a reference to the owning entity for this matrix, matrix_id uniquely identifies elements in a matrix,x and y would be the coordinates for the record and value is the cell value.

Then a query could look something like:

select * 
    from matrix
where matrix_id=?
order by x asc, y asc
John Kane
  • 4,383
  • 1
  • 24
  • 42
  • Thanks, this is actually what I have now. My reason for attempting to use a database is that my overall dataset is not small or simple at all. For example, on any given day I might perform 5 runs, each with 10 "tracks", each of those with 1,000 points in time, and maybe 5 covariances for each point in time (along with a lot of other data). I then might want to compare any of those tracks to another track from either that run or a different run. From what I knew of relational databases, this seemed like a potentially good application. – IanVS Jun 04 '12 at 18:44
  • Yeah, I was hoping the scale was smaller. A relational DB probably makes the most sense. – John Kane Jun 04 '12 at 18:48
  • csv is certainly the simpler way, which is why I started with that. Now I have an opportunity to improve the way I'm doing things and learn about databases in the process. Win-win! – IanVS Jun 04 '12 at 18:52
  • what other data are you storing in the db? Have you thought about how you are going to tie a matrix to its owning entity? – John Kane Jun 04 '12 at 18:53
  • Im not sure the type of database that you are using, but some support an array type, which you could use to back your matrix with. – John Kane Jun 04 '12 at 18:56
  • I'm using sqlite, which I don't believe has an array type (that sure would help). – IanVS Jun 04 '12 at 19:00
  • The way I'll tie the covariance back to its owning record will depend on how the covariance table is created. If I go with the one row per covariance (with fields for each of the terms), then it's a simple one to many between the parent and covariances. If I go the other route, I'd create a primary key composed of a unique covariance ID (not sure how I'd create this, actually), row field, and column field. Now that I'm thinking it through, the first way sure seems easier. – IanVS Jun 04 '12 at 19:03
  • It sounds like this could be a relatively clean schema matrix(record_id, parent_id, x, y, data). It helps with retrieval to index each record uniquely, and when you write your query you could search on the parent_id so it should be relatively quick. will these be just 2 dimensional matrices? – John Kane Jun 04 '12 at 19:06
  • Yes it will be 2 dimensions only, but there may be more than one covariance per parent (i.e. a one-to-many). I believe if that's the case I won't be able to separate out individual matrices if I do as you describe, correct? – IanVS Jun 04 '12 at 19:14
  • you could add another field for covariance_id and add a separate table holding a name or some other useful information related to it and use that id to uniquely identify each matrix. though you could also just get the nextVal from a sequencer (opposed to creating another table) when needed to uniquely identify a matrix. – John Kane Jun 04 '12 at 19:20
  • So do you see value in assigning each cell of the covariance to its own row in the database (as you're describing here) as opposed to simply creating a table that has 45 fields (one for each unique value in the matrix)? If so, can you explain the benefits in an answer so I can mark the question as answered? Thanks for your help! – IanVS Jun 04 '12 at 19:24
  • The decision of one row of 45 cols vs. 45 rows of 1 col is a hard one to make--it truly depends on the nature of the data and how it will be used. It ties in with normalization and denormalization issues, and I, for one, don't know enough about covariances to give a good answer. Design for performance and flexibility, do some testing and proofing, and don't be afraid to make radical changes if it looks like you're going down the wrong path. – Philip Kelley Jun 04 '12 at 19:43
  • @Phillip Kelley Yes 1 record with 45 cols could make sense, if the size of your data will never change, which seems unlikely. You have and will maintain much more freedom by using one record per cell. The lookup times should not be that much different since they are all indexed. Also, with having one record per cell you also gain the ability to easily query for specific records, outlying conditions, boundary testing, etc along with much more freedom to easily change your backing data structure or how you order your queried results, etc. – John Kane Jun 04 '12 at 19:57
  • It seems that using one record per cell offers much more freedom with a very minimal performance hit. Where as having to change one table with (9x9) (or (6x6)) columns would not be very clean. You would also have to keep track of the size of the matrix you are dealing with so you would know what columns to pull the data out of (depending on the backing data-structure), or create a table for every matrix size. – John Kane Jun 04 '12 at 20:04