0

I am using Snappy-java to encode JSON data and I want to store the result in database in a varchar column.
The database is an oracle database with ISO-8859-1 encoding.
I am facing an encoding problem when inserting the data. It would seem that some characters are not recognised by Oracle.

I've found a workaround by using Base64 encoding on the compressed data before inserting it. I can then retrieve it just fine :)

The problem with that is that Base64 encoding increases the length of the data that I am then storing, hereby reducing the savings gained with Snappy...

So my question is: How can I store that data without encoding it in Base64? The reason I want to use a varchar is because I want to be able to access the table using an oracle index without ever accessing the table (performance is definitely an issue).

I have tried other compression algorithms as well, but they all seem to have the same problem. I have also looked at yEnc but I cannot find any java encoder. Moreover I am not sure that I understands all the problems listed with yEnc, so I am bit reluctant using it.

Thanks a lot for any help!

Arnaud Potier
  • 1,750
  • 16
  • 28
  • Are you trying to index a column that contains binary data? That sounds like there is something wrong with your design. You will either have to drop the compression, or the index. – W. Goeman Jul 27 '12 at 15:26

4 Answers4

1

You're compressing data. The result of compression is almost always binary data, not text... I assume that's the case with Snappy too.

To accurately store binary data as text you should use something like Base64. You don't have characters - you have bytes.

To store the compressed text efficiently, you should change your database column to a binary-oriented type rather than a text-oriented type (a BLOB rather than a CLOB, basically).

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
0

The char in question may be a control char for the oracle db. In any case, you may want to consider saving off the data as 'text' or a 'clob' rather than as a varchar.

Short of that, you may need to employ a zip or huffman type compressor for the data before you insert it. Have you played with something like this: How to compress a String in Java??

Community
  • 1
  • 1
Noah
  • 1,966
  • 1
  • 14
  • 29
0

Jon Skeet's answer to simply use a binary field instead of text field is the obvious solution.

If you really need to use a text field, then first experiment to see what you can store in it. My guess is that you can store any byte except the zero byte, where the latter is used to terminate the variable length string. You can simply store a length 255 string with the bytes 0x01..0xff in it, and the retrieve it to see if you get exactly those bytes back. If you do, then the only byte to avoid is a zero.

If my theory is correct, then there are several simple ways to get rid of the zeros with limited expansion of the data. The easiest is when you get to a 0, send a 0x80, 0x01. When you get a 0x80, send an 0x80, 0x81. Then when decoding, if you see a 0x80, just get the next byte and subtract 1. This will expand the stream by less than 1% (0.78%) on average.

You can do better with more sophisticated coding schemes, if absolutely minimizing the expansion matters.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
0

Thank you all for your help!

I finally found a workaround. Since I am storing bytes and not chars, I am going to use a BLOB to store the data. The problem with the BLOB is that it cannot be indexed. The alternative is using a RAW type column. It stores bytes and is indexable. Unfortunately it is too small (2000 bytes). So, the answer in my case consist in storing the data in BLOB, and access it through an index on two RAW types since the data is never bigger than 4000 bytes.

The index looks like this:

CREATE INDEX blob_to_raw_prd_ix 
ON product (product_id, 
            substr_dt(blob_summary,2000,1), 
            substr_dt(blob_summary,2000,2001));

where

  • blob_summary is the BLOB column i store the data in
  • substr_dt is a user defined deterministic function (defined hereafter)

    CREATE OR REPLACE FUNCTION substr_dt(str BLOB, buffer_size int, offset int) RETURN RAW DETERMINISTIC IS BEGIN RETURN dbms_lob.substr(str,buffer_size,offset); END;

To access the data, I just need to query the product_id and the fields using aliases, e.g.

SELECT     /*+ index(blob_to_raw_prd_ix) */ product_id, 
                                            substr_dt(blob_summary, 2000, 1) AS summary1, 
                                            substr_dt(blob_summary, 2000, 2001) AS summary2
FROM       product
WHERE      (product_id = ?);

In this case, summary_1 represents the first 2000 bytes of the blob, and summary 2 the last 2000 bytes. Using concatenation on the two arrays summary1 and summary2 I get the content of the blob.

That works with Jdbc but I could not make it work with Hibernate (yet). It is not the best solution ever as data needs reprocessing before being interpreted. However, it solves the data access problem without the encoding space overhead.

Arnaud Potier
  • 1,750
  • 16
  • 28