2

Right now we have a database table (SQL Server 2008 R2) that stores an uploaded file (PDF, DOC, TXT, etc.) in an image type column. A user uploads this file from an ASP.NET application. My project is to get a handle on the size at which this table is growing, and I've come up with a couple of questions along the way.

  1. On the database side, I've discovered the image column type is supposedly somewhat depreciated? Will I gain any benefits to switching over to varbinary(max), or should I say varbinary(5767168) because that is my file size cap, or might as well I just let it stay as an image type as far as space-efficiency is concerned?

  2. On the application side, I want to compress the byte array. Microsoft's built in GZip sometimes made the file bigger instead of smaller. I switched over to SharpZipLib, which is much better, but I still occasionally run into the same problem. Is there a way to find out the average file compression savings before I implement it on a wide scale? I'm having a hard time finding out what the underlying algorithm is that they use.

  3. Would it be worth writing a Huffman code algorithm of my own, or will that present the same problem where there is occasionally a larger compressed file than original file?

For reference, in case it matters, here's the code in my app:

    using ICSharpCode.SharpZipLib.GZip;

    private static byte[] Compress(byte[] data)
    {
        MemoryStream output = new MemoryStream();

        using (GZipOutputStream gzip = new GZipOutputStream(output))
        {
            gzip.IsStreamOwner = false;
            gzip.Write(data, 0, data.Length);
            gzip.Close();
        }
        return output.ToArray();
    }

    private static byte[] Decompress(byte[] data)
    {
        MemoryStream output = new MemoryStream();
        MemoryStream input = new MemoryStream();
        input.Write(data, 0, data.Length);
        input.Position = 0;

        using (GZipInputStream gzip = new GZipInputStream(input))
        {
            byte[] buff = new byte[64];
            int read = gzip.Read(buff, 0, buff.Length);

            while (read > 0)
            {
                output.Write(buff, 0, read);
                read = gzip.Read(buff, 0, buff.Length);
            }

            gzip.Close();
        }
        return output.ToArray();
    }

Thanks in advance for any help. :)

Brandi
  • 1,549
  • 4
  • 24
  • 32
  • Haven't worked enough with SQL Server/this type of problem, but I've heard that often the resources aren't persisted into the database itself (especially for things like large files), only references to the location on the webserver. – Clockwork-Muse Nov 02 '12 at 21:03
  • 1
    You may want to consider having a separate column for the compression algorithm. That way you can use different algorithms over time, for different file types, or note that no compression was used if it would only cause the data to grow. – HABO Nov 03 '12 at 00:20
  • Clockwork-Muse, there is a risk of bad links if the files would be moved at some point in the future. You have nothing connecting the link and the files besides one programmer who happens to know that those specific files should not be moved. Probably okay for some stuff, but in my case these are sensitive legal docs, so for better or worse, we want to ensure they are there to stay. Given this constraint, compression is a way to mitigate the size explosion a little bit. Thanks anyway for the suggestion! – Brandi Nov 06 '12 at 19:47

2 Answers2

3

that's not a byte array, that's a BLOB. 10 years ago, you would have used the IMAGE datatype.

these days, it's more efficient to use VARBINARY(MAX) I really reccomend that people use FILESTREAM for VarBinary(Max) as it makes backing up the database (without the blobs) quite easy.

Keep in mind that using the native formats (without compression) will allow full text searches.. Which is pretty incredible if you think about it. You have to install some iFilter from Adobe for searching inside PDF.. but it's a killer feature, I can't live without it.

Aaron Kempf
  • 580
  • 2
  • 11
  • Right! It's a Byte Array in my application. :) Do you have any sources that the varbinary(max) type is more efficient than the Image type? Also do you mean the varbinary type is more space efficient, time efficient, or both? It is a pretty sweet feature to search within it, although I don't need the feature, and I doubt it'd work with the data once it was compressed, anyway. – Brandi Nov 02 '12 at 21:27
  • Isn't filestream data backed up just like regular data? It has to for consistency. – usr Nov 02 '12 at 22:16
  • filestream objects can be included or excluded in the backup. I know it doesn't sound like it's that big of a deal.. but I've worked with too many blobs to ALWAYS want to back them up. – Aaron Kempf Nov 02 '12 at 22:28
  • also.. I -think- that you -might- be able to use NTFS compression for this, and still support FullTextSearch. I know I'd at least TRY that before assuming it's not an option. – Aaron Kempf Nov 02 '12 at 22:29
  • it looks like I'm wrong on that. I remember in SQL 2008, I was DYING to get compression working.. and I thought that they were going to offer it in a future version. I haven't checked to see if that is available in SQL 2012 yet. – Aaron Kempf Nov 02 '12 at 22:31
  • here is a situation where someone is claiming to get a 3:1 compression ratio using standard edition and NTFS compression http://www.codewrecks.com/blog/index.php/2011/05/24/compress-data-in-sql-server-2008-standard-thanks-to-filestream-and-nhibernate/ – Aaron Kempf Nov 02 '12 at 22:41
  • I think you can not backup filestream data but that causes it to be a partial backup which is worthless by itself. To restore you need a recent copy of the filestream data. I cannot image SQL Server allowing data to become inconsistent. – usr Nov 02 '12 at 23:35
  • I think that the NORMAL backup definitely includes FileStream objects.. that's the default. I really was lookign for the argument on the BACKUP command.. but I really wasn't able to find it. – Aaron Kempf Nov 02 '12 at 23:38
1

I hate to be a jerk and answer my own question, but I thought I'd summarize my findings into a complete answer for anyone else looking to space-efficiently store file/image data within a database:

* Using varbinary(MAX) versus Image?

Many reasons for using varbinary(MAX), but top among them is that Image is deprecated and in a future version of SQL it will be removed altogether. Not starting any new projects with it is just nipping a future problem in the bud.

According to the info in this question: SQL Server table structure for storing a large number of images, varbinary(MAX) has more operations available to be used on it.

Varbinary(MAX) is easy to stream from a .NET application by using an SQL Parameter. Negative one is for 'MAX' length. Like so:

SQLCommand1.Parameters.Add("@binaryValue", SqlDbType.VarBinary, -1).Value = compressedBytes;

* What compression algorithm to use?

I'm really not much closer to a decent answer on this one. I used ICSharpCode.SharpZipLib.Gzip and found it had better performance than the built in zipping functions simply by running it on a bunch of stuff and comparing it.

My results:

I reduced my total file size by about 20%. Unfortunately, a lot of the files I had were PDFs which don't compress that well, but there was still some benefit. Not much luck (obviously) with file types that were already compressed.

Community
  • 1
  • 1
Brandi
  • 1,549
  • 4
  • 24
  • 32