2

I'm working on constructing a database (SQLite) to store information about each run of a Mathematica script I've written. The script takes several input parameters, so my DB has a table with a column for each parameter (among other columns).

Some of the input parameters are lists of numbers. My first thought for storing these is to use a junction table as described in the accepted answer to this question. But I typically use the same list for several different runs. How can I look up whether any given list is already in the database, so I can reuse its ID rather than storing it again?

Constraints as mentioned in comments:

  • There is no explicit upper bound on the length of a list but in practice it ranges from 1 to about 50.
  • The number of distinct lists will be small, on the order of 10.
  • I actually have 3 list parameters. For two of them, the values in the list are non-negative, double precision floating point numbers; for the third, the values are pairs of such numbers.
  • There are no duplicate entries. (These are more precisely sets, so no duplicates and order is irrelevant)
  • I can easily arrange for the list elements to be in sorted order.

For example: suppose my table is set up like this

CREATE TABLE jobs (id INTEGER PRIMARY KEY, param1 REAL, param2_id INTEGER);
CREATE TABLE param2 (param2_id INTEGER PRIMARY KEY, value REAL);

When I run the script, it sets the parameters and then calls a function to run the calculation, like so:

param1 = 4;
param2 = {.1, .3, .5};
runTheCalculation[param1, param2]

Assuming this is the very first run of the script, it will insert the following contents into the DB:

jobs:   id      param1     param2_id
         1       4.0        1

param2: param2_id   value
         1           0.1
         1           0.3
         1           0.5

So far, so good. Now let's say I run the script again with one different parameter,

param1 = 2;
param2 = {.1, .3, .5};
runTheCalculation[]

In a naive implementation, this will result in the database containing this:

jobs:   id      param1     param2_id
         1       4.0        1
         2       2.0        2

param2: param2_id   value
         1           0.1
         1           0.3
         1           0.5
         2           0.1
         2           0.3
         2           0.5

But I would like it to be able to look up the fact that the list {.1, .3, .5} is already in the database, so that after the second run the DB contains this instead:

jobs:   id      param1     param2_id
         1       4.0        1
         2       2.0        1

param2: param2_id   value
         1           0.1
         1           0.3
         1           0.5

What sort of a query can I use to find that the list {.1, .3, .5} already exists in the table param2?

I'm not opposed to creating additional tables if necessary. Or if there is some model other than using a junction table that makes more sense, that's fine too.

Community
  • 1
  • 1
David Z
  • 128,184
  • 27
  • 255
  • 279
  • 2
    What's the maximum length of a list? And what is the possible range of values for each entry in the list? (e.g., is each entry a rounded decimal between 0.000 and 1.000?) Finally, are lists always pre-sorted (and can they have duplicate entries)? The answers to these questions have implications for the possible solutions you can employ. – Julius Musseau Dec 16 '11 at 22:42
  • I've edited that information in, thanks for the feedback Julius. – David Z Dec 16 '11 at 23:04

3 Answers3

1

If the list is short, and the quantity of lists is relatively small, then you can simply sequence the lists in the TBL_Lists and see if yours matches. This is pretty inefficient as it will enumerate all stored lists to compare to your one stored list.

Another way, and the better way in my opinion, would be to hash the list and store its hash in a TBL_List_Hashes

Hashing the list will require enumerating it one time.

An example hashing algorithm might be to build a string of all the sorted numerical values, uniformally padded, then run any hashing method on the concatenated string.

It should be relatively easy to obtain a hash of a given list and then retrieve the matching hash from the DB. Even with a relatively simple hash algorithm with collisions you will be able to significantly reduce the number of lists you need to validate in order to make the comparison.

So if your hash algorithm has collisions then you're adding an enumeration (and query) expense for each erroneous match.

EDIT:
Here is a relevant answer for .net
.net 3.5 List<T> Equality and GetHashCode

EDIT2:
And if you are order-agnostic in your matching then simply standardize the list order before hashing
GetHashCode for a Class with a List Object

Community
  • 1
  • 1
Matthew
  • 10,244
  • 5
  • 49
  • 104
1

You ask: How can I look up whether any given list is already in the database?

The normal way is to use an index, and indexes are always row-oriented. So standard database design suggests you somehow need to get the whole list (normalized) into a row.

Since you're on SQLLite, you don't have too many options:

http://www.sqlite.org/datatype3.html

I recommend TEXT! You can index BLOB as well, and BLOB will save some space, but probably TEXT will work just fine, and TEXT is usually a lot more convenient to debug and work with. Try to invent some kind of canonical String format for your lists that you can parse/generate, and always INSERT/SELECT that from the database in a consistent way (e.g., consistent rounding, pre-sorted, duplicates removed, trailing and leading zeroes always consistent), and you should be fine.

Warning: it's a low-engineering approach, and perhaps even "not-the-right-way (TM)," but if it gets the job done....

Julius Musseau
  • 4,037
  • 23
  • 27
0

In general, don't use lists, unless you have a very unusual set of requirements, and enough hands-on experience to anticipate the consequences.

A many-to-many relationship contained in a junction table, with appropriate indexes, will perform just as well and be much easier to use. It's also more flexible.

Walter Mitty
  • 18,205
  • 2
  • 28
  • 58
  • Walter, a list is a valid value just like a set is, or a number is, or a piece of text is. The list [1,2,3] being equal to the list [1,2,3] but not equal to the list [1,3,2], is just the same concept as the set {1,2,3} being equal to the set {2,3,1}, but not to the set {}, and also the same concept as the number 2 being equal to the number 2, but not to the number 5. If the requirements are such that "set equality" or "list equality" is involved, well then the solution for the requirement is likely to involve set/list equality too, wouldn't you think ? ... – Erwin Smout Dec 17 '11 at 18:02
  • ... The fact that SQL engines [typically] do not have decent "native" support for set and list types, and thus for set/list equality, and that therefore the user is in big trouble whenever he needs to use the concept, does not mean that the user should try to avoid it, does it ? – Erwin Smout Dec 17 '11 at 18:03