1


I'm working on a tokenization system to store sensitive data in a key value table, but the default select order can be a weak link of the chain if it reproduce the order of the INSERT queries. I think that it much easier to understand with an example.

Suppose that we have this data:

FirstName: Bob
LastName: Smith
Yearly income: $56000

We associate for every value a random token:

58217 => Bob
79174 => Smith
37142 => $56000

We have two tables on two different databases:
the first one has the columns "FirstName", "LastName" and "YearlyIncome" and we store on it a row with values 58217, 79174 and 37142;
the second one has the columns "Token" and "PlainText" where we store three records:

58217, "Bob"
79174, "Smith"
37142, "$56000"

The purpose of tokenization is that if an intruder can read just one database, he cannot pull out any useful information.

BUT: if a simple SELECT * FROM Tokens give the records in the same order we inserted them, it's really simple to read that Bob Smith has an yearly income of $56000. Just read from top to bottom.

I know that SQL doesn't guarantee that the default select order reflects the insert order, but really often it does, and in this context it's a problem, so the question is:
there is a way to shuffle the default order or - better - to shuffle how the data is physically stored on the database?

I know that we can dump all data, shuffle it and then re-insert it with the new order but I think it's a too much expensive way to do it; and I'm sure that there is already a smooth solution but I didn't found find anything useful on Internet, so here I am :)

Thank you very much!

SpecialFx
  • 13
  • 2
  • 2
    There is no "default" ordering. It's *non-deterministic*, which *isn't* to say that it's random, but is to say that you cannot assume any ordering. The only thing that *guarantees* order is the `ORDER BY` clause during retrieval. – Damien_The_Unbeliever Mar 20 '19 at 09:41
  • 1
    I removed all dbms tags, please add the tag for the dbms you are actually using. – Joakim Danielson Mar 20 '19 at 09:52
  • So if the intruder figures out how to join proper, he still sees all data ? – GuidoG Mar 20 '19 at 09:54
  • "if an intruder can read just one database" - usually the step from zero to one is a lot more difficult than the step from one to two. If you assume the intruder can access one DB, you almost certainly should assume they can quickly get to two. Especially if the break comes from e.g. an application that has credentials to access both. – Damien_The_Unbeliever Mar 20 '19 at 10:26
  • @GuidoG: yes. Blame GDPR, not me :D To be honest, there is a clear advantage: you centralize the security efforts. If you have high security standard where you store the tokens / values pair, then you have an high security standard for the whole system (I know that's not entirely true but in a real world scenario, when you haven't the resources to adopt high security standard for every application, it's a lot better). To better explain: to access to token / value database we are working on an intermediate service, no direct database access. – SpecialFx Mar 20 '19 at 12:26

1 Answers1

1

Some RDBMS solutions have the concept of "clustered index". A clustered index stores records in the order of the index, so if your "token" column is truly random, creating a clustered index on that column will probably achieve what you want. Though, as @damien_the_unbeliever writes, the behaviour is non-deterministic, so there's no guarantee.

You should be aware that this will create possible performance challenges - by re-ordering the table every time you create a new record, you may experience a noticable slowdown.

I would also question whether the significant additional work and complexity you're introducing really gives you any security benefit. By definition, there must be a way of combining data from the two databases; I imagine that's because you're creating a SQL user you are "confident" cannot be hacked; if you are "confident", why not store the data in that user's database? You might also look at database-native encryption.

Finally, tokenizing has a specific meaning in computer science, and it's not apparently what you are using.

Neville Kuyt
  • 29,247
  • 1
  • 37
  • 52
  • 1
    Yeah, even with just a clustered index and no other indexes you can end up with carousel scans and other things that mean results aren't delivered in clustered index order. – Damien_The_Unbeliever Mar 20 '19 at 09:58
  • @Damien_The_Unbeliever . . . I don't think that really matters. If the clustering key is random, then returning the rows in another order will result in . . . randomness. The big issue is the fragmentation that such an approach entails. – Gordon Linoff Mar 20 '19 at 10:56
  • Thank you all for your replies. First of all, let me give you a little more context. We are working to a implementation to store data in a way compliant to the european GDPR, that encourage the use of encryption / tokenization (I agree that "tokenization" isn't the correct term but it's the one chose by who wrote the GPDR and now is widely used with this meaning, in this context. Just Google "GDPR tokenization"). You can read a short introduction here: https://medium.com/@alexewerlof/gdpr-pseudonymization-techniques-62f7b3b46a56 – SpecialFx Mar 20 '19 at 11:57
  • I also agree that tokenization is not the final solution to protect the sensitive data because it's useless if there is an attack to the application level; it should be considered as a mitigation solution that is effective to protect the data just from some attacks, like a data leak from a single database. To better explain: in our solution the application has the direct connection just with one database, the first one; the access to the second one (the one with tokens / plaintext values) is mediated by a service that adds checks to verify if the use is legitimate. – SpecialFx Mar 20 '19 at 12:09
  • Anyway: clustered index seems what we are looking for, thank you very much. :) – SpecialFx Mar 20 '19 at 12:10