0

While the question heading is complete in itself, I'll provide some meta info as to how I hit this problem.


With SQL Server and MySQL it was all good and fine with any kind of and any size of data but recently I started working with a DBMS that allows only a max of 64 KB of data in each column of a table. The limit of 64 KB is problematic when there is binary data in the columns. Other examples are - a column containing an image or an audio or a multimedia object or geospatial data.

You won't always get images of size less than 64 KB. BLOBs can simply not be stored in such DBMS. The DBMS allows users to add to its capabilities by writing functions in C/C++ and calling them. But with even these functions we can return a max of 64 KB data per call.

Somebody suggested me a workaround -

store the binary data on file system and store an identifier to that particular binary data in the table's column. This way, when anyone requests that data, the data can be sent back to the caller in chunks of 64 KB size each.

But then the second limitation came up, which is that the data can be sent from the DBMS as one of the following datatypes only - char / varchar / integer / smallint / bigint / boolean / real. The most favoured candidate in this list was varchar(max) for obvious reasons (that it's the longest).

So, can a binary file be read and sent as text? The client end which talks with DBMS is being developed in C#.

displayName
  • 13,888
  • 8
  • 60
  • 75

1 Answers1

4

There is only one problem with reading a binary file chunk as text and then sending that text. That problem is that the converted text does not return that same binary from which this text was created.

It's straightforward enough to understand without providing links that at a sufficiently low level of abstraction, all files are "binary" in the sense that they just contain a bunch of numbers encoded in binary form.

However, it's VERY important to distinguish between

  • text files, where all the numbers can be interpreted as characters representing human-readable text, and
  • binary files, which contain data that, if interpreted as characters, yields non-printable characters.1

So, we may read a binary file which may contain the byte combination for producing a sound and because it is non-printable, it will be irrecoverably lost. There are byte combinations for moving the cursor position. No sooner you hit one, you lose the positioning.

Not only any Control Character will be lost when converted to text, the entire printable text would also not be in place, resulting in garble out of your binary file.

So, binary to text to binary conversion is lossy.


Alternative?

The most commonly found alternative is to convert the binary data to its hexadecimal representation and then send back this string equivalent of binary data. Now, the size of a byte is obviously 1 byte. The size of 1 char is 1 byte too but it takes 2 chars to represent 1 byte in hex form. In other words, you get 2 GB of text when you want to retrieve a 1 GB binary file.

However, when taking this approach there is support from SoapHexBinary class which reduces the hexadecimal string to binary conversion down to one method call, demonstrated in this answer to How do you convert Byte Array to Hexadecimal String, and vice versa.

--------

Not proven to be better but another method is as follows:

  • Create a string stream (or some equivalent of it like a string builder).
  • Pick any char (it'll take 1 byte) as a delimiter.
  • Pick the biggest possible primitive numeric supported by the DBMS. Let its size be k bytes. In my DBMS's case it is BIGINT which is an integer of fixed size of 64-bits (8 bytes).
  • Read the binary file k bytes at a time and create the equivalent numeric value.
  • Insert the numeric value to the string stream, char delimited.
  • When the size of the stream/builder has reached to max value possible, return this string.

At the client end, extract the char delimited strings, convert them to the numeric value, get bytes from this value and concatenate these bytes to re-create the binary file.


1In ASCII, the first 32 characters are nonprinting control characters originally used to control the behavior of a Teletype machine, causing it to do such things as sound the bell, back up one character, move to a new line, and move the carriage to the beginning of the line. Of these 32 control characters, only three, the newline, carriage return, and horizontal tab, are typically found in text files.

Community
  • 1
  • 1
displayName
  • 13,888
  • 8
  • 60
  • 75
  • 1
    One thing you gloss over in alternative but is very widely used is Base64 encoding, that gives you a increase in size of 4/3 (~1.333x) vs. hex encoding's 2x. It also is reduceable to a single call via `Convert.ToBase64String` and `Convert.FromBase64String` – Scott Chamberlain Oct 28 '15 at 21:02