1

Let's say we're using Django with a Postgres database.

I want to store a sequence of data like so:

  • Record 1: 1, 2, 3, 4, 5
  • Record 2: 7, 8, 2, 3, 1, 9, 6
  • Record 3: 4, 4, 3, 2

A couple points to note:

  • The sequence will always be one-dimensional.
  • A sequence can have redundant values.
  • A sequence can be of variable length.

So first thing; I want to store this information in the database. There are lots of ways I can accomplish this, so let's look into my querying requirements.

Let's say I have a query sequence 1, 2, 3. Now I want to identify the sequences that match this sequence. A match would meet one of the following cases:

  • Case A: The sequence contains the query 1, 2, 3 in that order.
    • In the example, Record 1 matches this.
  • Case B: The sequence contains the components of the query 1, 2, 3 in any order.
    • In the example, Record 1 and 2 match this.
  • Case C: The sequence contains some of the components of the query 1, 2, 3 in any order.
    • In the example, all records match this.

In a perfect world I'd like the results to be ranked such that:

  • Record 1 comes first (because it matches Case A, our highest-priority match)
  • Record 2 comes second (because it matches Case B, our mid-priority match)

  • Record 3 comes last (because it matches Case C, our low-priority match)

Can anyone recommend a method, library, or concept for storing this data in such as way that queries can be made relatively fast?

Bryant Makes Programs
  • 1,493
  • 2
  • 17
  • 39
  • Since you’re using a relational database, you probably want to normalize these records: have a RecordValue table that matches a record ID (which could be a foreign key to a Record table, but you may not need that) and index to a value. Then you can write these queries as selects against that table. They may not be trivial (e.g. you may need to select all record IDs where the count of unique values that are 1, 2, or 3 is 3), but they will be doable. – abarnert May 11 '18 at 17:08
  • You thinking something like a table with: `sequence_identifier`, `value`, `position`? With each value in each sequence being stored in its own record. That would probably work; I could query for matching values in sequential order and matching values in non-sequential order. And then case C could be managed by getting a count of matching results for each sequence. – Bryant Makes Programs May 11 '18 at 17:11
  • have a look on `from difflib import SequenceMatcher` may be that will help you somehow. That worked for me in this my situation: https://stackoverflow.com/q/47682491/8495108 – Chiefir May 11 '18 at 17:34

1 Answers1

1

If you are using django and postgres look into the ArrayField. They are queryable.

chaggy
  • 1,091
  • 1
  • 10
  • 18