2

I can't seem to find any examples of anyone doing this on the web, so am wondering if maybe there's a reason for that (or maybe I haven't used the right search terms). There might even already be a term for this that I'm unaware of?

To save on database storage space for regularly reoccurring strings, I'm thinking of creating a MySQL table called unique_string. It would only have two columns:

  1. "id" : INT : PRIMARY_KEY index
  2. "string" : varchar(255) : UNIQUE index

Any other tables anywhere in the database can then use INT columns instead of VARCHAR columns. For example a varchar field called browser would instead be an INT field called browser_unique_string_id.

I would not use this for anything where performance matters. In this case I'm using it to track details of every single page request (logging web stats) and an "audit trial" of user actions on intranets, but other things potentially too.

I'm also aware the SELECT queries would be complex, so I'm not worried about that. I'll most likely write some code to generate the queries to return the "real" string data.

Thoughts? I feel like I might be overlooking something obvious here.

Thanks!

LaVache
  • 2,372
  • 3
  • 24
  • 38
  • 1
    I may be missing what your looking to do, but what your describing sounds exactly like the vanilla best practice of database normalization; http://stackoverflow.com/questions/723998/can-someone-please-give-an-example-of-1nf-2nf-and-3nf-in-plain-english – Alex K. May 22 '12 at 14:06
  • [Normalisation?](http://en.wikipedia.org/wiki/Database_normalization) - "Database normalization is the process of organizing the fields and tables of a relational database to **minimize redundancy** and dependency." - From Wikipedia – Mr Moose May 22 '12 at 14:07
  • what I understand you want to store the varchar into a table with an id, and you are going to work with ID instead of the varchar type, right? – jcho360 May 22 '12 at 14:09
  • Yeah I guess the reason I'm asking is that I haven't seen any examples of having one main string table that is referenced across the whole database for absolutely anything that is a string. Generally there'll be multiple tables that are specifically a list of "somethings", such as a table of job titles. – LaVache May 22 '12 at 14:14
  • You might be interested in the ARCHIVE storage engine - http://dev.mysql.com/doc/refman/5.5/en/archive-storage-engine.html - it compresses data and I bet zlib is going to do a pretty good job compared to what you might do in SQL querys WRT compression. – Andrew May 22 '12 at 14:49
  • Thanks Andrew, very interesting! I'm never heard of this before. I think I'll give it a go, sounds perfect for the web request logging part. Will make things much simpler. – LaVache May 22 '12 at 15:03
  • @AlexK.: Replacing strings with surrogate id numbers has nothing to do with normalization. – Mike Sherrill 'Cat Recall' May 23 '12 at 07:44

3 Answers3

1

I have used this structure for a similar application -- keeping track of URIs for web logs. In this case, the database was Oracle.

The performance issues are not minimal. As the database grows, there are tens of millions of URIs. So, just identifying the right string during an INSERT is challenging. We handled this by building most of the update logic in hadoop, so the database table was, in essence, just a copy of a hadoop table.

In a regular database, you would get around this by building an index, as you suggest in your question. And, an index solution would work well up to your available memory. In fact, this is a rather degenerate case for an index, because you really only need the index and not the underlying table. I'm do not know if mysql or SQL Server recognize this, although columnar databases (such as Vertica) should.

SQL Server has another option. If you declare the string as VARCHAR(max), then it is stored no a separate data page from the rest of the data. During a full table scan, there is no need to load the additional page in memory, if the column is not being referenced in the query.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thanks, yeah interesting point on things going south when the index gets large. For this project I think it should be ok, as there's only about 10 URLs and a limited set of browsers etc. I'll only use it for regularly reoccurring strings. – LaVache May 22 '12 at 14:55
0

This is a very common design pattern in databases where the cardinality of the data is relatively small compared to the transaction table that it's linked to. The queries wouldn't be very complex, just a simple join to the lookup table. You can include more than just a string on the lookup table, other information that is commonly repeated. You're simply normalizing your model to remove duplicate data.

Example:

Request Table:

Date    
Time   
IP Address    
Browser_ID  

Browser Table:

Browser_ID
Browser_Name
Browser_Version
Browser_Properties
N West
  • 6,768
  • 25
  • 40
  • Thanks for your response. I think I could have been more specific with my question. I'm asking more about having one master table of strings for any kind of data at all, that is used across the whole DB rather than having multiple string tables for specific entities. – LaVache May 22 '12 at 14:42
  • Replacing text with an id number has nothing to do with normalization. And you're not removing duplicate data; you're replacing repeated text with repeated numbers. – Mike Sherrill 'Cat Recall' May 23 '12 at 07:49
  • If you have "Browser Name, Browser Version" in your main table, and replace it with a foreign key on the main table with a 1-M relationship to the "Browser" table, you've removed duplicate data from the filesystem and replaced it with one key. *Foreign Keys* are an essential part of normalization. http://en.wikipedia.org/wiki/Foreign_key – N West May 23 '12 at 14:37
0

If you planning on logging data in real time (as opposed to a batch job) then you want to ensure your time to write a record to the database is as quick as possible. If you are logging synchronously then obviously the record creating time will directly affect the time it takes for a http request to complete. If this is async then slow record creation times will lead to a bottleneck. However if this is batch job then performance will not matter so long as you can confidently create all the batched records before the next batch runs.

In order to reduce the time it takes to create a record you really want to flatten out your database structure, your current query in pseudo might look like

SELECT @id = id from PagesTable
WHERE PageName = @RequestedPageName

IF @id = 0
THEN 
  INSERT @RequestedPageName into PagesTable
  @id = SELECT @@IDENTITY 'or whatever method you db supports for              
                          'fetching the id for a newly created record
END IF

INSERT @id, @BrowserName INTO BrowersLogTable 

Where as in a flat structure you would just need 1 INSERT

If you are concerned about data Integrity, which you should be, then typically you would normalise this data by querying at writing it into a separate set of tables (or a separate database) at regular intervals and use this for querying against.

Jaimal Chohan
  • 8,530
  • 6
  • 43
  • 64
  • Awesome, thanks. This is exactly the kind of thing I hadn't thought of (the performance during log creation that is). In my case it should be ok. Regardless of the number of "unique_string" fields in any table, I can pull out all the IDs with one initial SELECT query before the INSERT (assuming the values have already been used in the past, which they will have mostly). If my site ever gets busy enough that this slows things down, that would be an awesome problem to have. =) – LaVache May 22 '12 at 14:39