0

I am a novice what concerns database design and stumbled across the following problem (for a scientific database findstat.org):

I have a list of some hundreds of "combinatorial maps" (specified by an integer), each of which is specified by a list of some ten-thousands of preimage-image pairs, both of which are just integers. The preimages are in fact consecutive integers.

(EDIT) The table is only read and written as a whole. There is never ever need to access only a few values. More precisely, the "natural" unit is the data associated with one "combinatorial map". (END OF EDIT)

The current design, which sort of works, is as follows:

The table has (essentially) two columns: the "ID of the combinatorial map" (an INT) and "values" which is a LONGTEXT - the images of 1,2,..., separated by "|".

I was told that this is not really proper design, and tried the following variation. The new table has three columns: the "ID of the combinatorial map", "preimage" and "image".

This new table thus has many many many rows.

QUESTIONS:

  • Looking at the memory consumption, it appears that the new table uses much more memory. Is this to be expected? (EDIT: as it turns out, I am roughly doubling the memory cunsumption. This is rather surprising for me, because in the new design, I store integers as INT, whereas in the old design I store integers as text. A blog post by Mark Callaghan from 2009 mentions that the overhead is roughly 20 Bytes per row, which is about 24MB for a table of 1 250 000 rows. I would have expected that the overhead of storing integers as text is much larger, but I was mistaken. END OF EDIT)

  • I populate this table starting with the empty table using python and sqlalchemy. For my application, it seems right to first add all rows, and then commit. While this is unproblematic for the first design, my computer "ran out of memory" - that is, it started swapping until it became completely unresponsive. Is it likely that I did something wrong, or is the "new" design simply bad?

  • Partial answer: `INT` always takes 4 bytes (probably plus 1 byte overhead in InnoDB). A number in a |-delimited string takes 2-12 bytes: the string for the number, plus the `|`. – Rick James Nov 02 '15 at 20:29
  • Indeed. Most images will have 5 digits, so the "bad" design needs just 6 bytes per image, whereas the "good" design needs 20 bytes for the row and something like 8 bytes for the ID, the preimage and the image (if I use smallints and mediumints). Not so surprising after all... – Martin Rubey Nov 03 '15 at 07:30

1 Answers1

1

You are performing the correct design in the new design, by not having de-normalized data in an entity's column. That is, a single value, not a set of values. So your separator of | falls into the category of comma separated list 'values' if you call it that in a single column (in the prior or current design, and that is bad). When I say new design, I mean the one where you have more rows and are concerned about.

Having more rows will incur more storage size naturally. But keep in mind, that with the correct index strategy, your data will be accessible much faster, and a design in which the sql engine excels.

Developers often think they are doing the server a favor by having comma-separated lists. They are not. By a long shot.

I see no reason to create a transaction, then inserting hundred of thousands of rows, then a commit. Just populate for the pump-priming action, and see if you are satisfied with the row counts and any other sanity checks that you can perform.

Drew
  • 24,851
  • 10
  • 43
  • 78
  • The reason for emptying the table, inserting everything and then committing, is that during the whole process (which may fail), I want to be able to allow other users access to the old version of the table. I guess a workaround may be to name the table differently, and rename it after success. Does this seem right? – Martin Rubey Oct 25 '15 at 10:33
  • I also wonder *why* having more rows increases storage size - the amount of data is the same, after all. How can I minimize the overhead each row incurs? – Martin Rubey Oct 25 '15 at 10:37
  • the overhead for the setup of a new row. They are on datapages after all. Indexes have fill factor slots. Consider it to be overhead, stats, pointers. There simply is ZERO benefit for CSV in tables like your original design, if one wants to access it quickly, modify it, have a sane approach to using sql, unless find_in_set and parsing techniques excites one – Drew Oct 25 '15 at 14:21
  • I am sorry to say that I am for performance, not doing the server a favor in a narrowly focused "how can I minimize the overhead each row incurs". I like to write answers and give tips on making the programmer's (and the server's) job easier [Like this One](http://stackoverflow.com/a/32620163). I am not a big fan of slow table scans – Drew Oct 25 '15 at 14:24
  • as for your first question, sure, create another table. Keep in mind that the transaction creates a memory concern, and logging overhead, all of which are unnecessary if you know the metrics at the end to verify its completeness. You are just loading a table afterall (and I would do [LOAD DATA INFILE](http://stackoverflow.com/a/33307547)). For day-in and day-out transactional statements, sure, those are a must. But that is not what you are doing. And you are experiencing problems as you described with wrapping such massive of an operation in a transaction. – Drew Oct 25 '15 at 14:33
  • I'm not totally convinced yet... Possibly I shouldn't store these data in a database table at all. In fact, I do not understand your comment concerning "LOAD DATA INFILE". In my application, the values of the table are computed once (which is time-consuming), and stored. They are never modified. Whenever they are retrieved, I have to retrieve all of them at once... – Martin Rubey Oct 28 '15 at 14:39
  • I was referring to your phrase `I populate this table starting with the empty table using python and sqlalchemy. For my application, it seems right to first add all rows, and then commit.` ... in reference to load data infile – Drew Oct 28 '15 at 16:04
  • but I agree. We sometimes have to ask that question of does it belong in a db. Good luck ! – Drew Oct 28 '15 at 17:48
  • Oh, sorry, yes, that was misleading. Could you suggest an alternative to a database table? – Martin Rubey Oct 29 '15 at 09:18