4

I'm working with an Oracle database and want to determine the length in bytes of a NCLOB using a multibyte charset (UTF-8).

LENGTHB() does not support CLOBs oder NCLOBS with multibyte charset. I could convert the NCLOB to a BLOB and get its length then. But isn't there a better way to do this?

Dave
  • 43
  • 2
  • 6

1 Answers1

4

Oracle stores CLOB in UTF-16 (or possibly your NCHARACTER_SET?). Each character is stored as two bytes.

Here's how you can see the raw data:

SQL> CREATE TABLE test_clob (c CLOB, v VARCHAR2(10 CHAR), nv NVARCHAR2(10));

Table created

SQL> INSERT INTO test_clob VALUES ('0123456789', '0123456789', '0123456789');

1 row inserted

SQL> SELECT dbms_rowid.rowid_relative_fno(ROWID),
  2         dbms_rowid.rowid_block_number(ROWID)
  3    FROM test_clob;

DBMS_ROWID.ROWID_RELATIVE_FNO( DBMS_ROWID.ROWID_BLOCK_NUMBER(
------------------------------ ------------------------------
                            13                          94314

SQL> alter system dump datafile 13 block 94314;

System altered

Navigate to your USER_DUMP_DEST directory and open the trace file, you should see something like this:

col  0: [56]
 00 54 00 01 02 0c 80 00 00 02 00 00 00 01 00 00 03 64 c6 a5 00 24 09 00 00
 00 00 00 00 14 00 00 00 00 00 01 00 30 00 31 00 32 00 33 00 34 00 35 00 36
 00 37 00 38 00 39
LOB
Locator:
  Length:        84(56)
  Version:        1
  Byte Length:    2
  LobID: 00.00.00.01.00.00.03.64.c6.a5
  Flags[ 0x02 0x0c 0x80 0x00 ]:
    Type: CLOB 
    [...]
    Inline data[20]
    [...]
col  1: [10]  30 31 32 33 34 35 36 37 38 39
col  2: [20]  00 30 00 31 00 32 00 33 00 34 00 35 00 36 00 37 00 38 00 39

As you can see the CLOB (column 0) is composed of a few header bytes and the same byte raw data as the UTF-16 NVARCHAR2 column.

As such I think you will have to convert your CLOB to UTF-8 to determine its length in this character set.

Here's an example I've used using DBMS_LOB.converttoblob:

SQL> DECLARE
  2     l_clob         CLOB := 'abcdéfghij'; -- the é will take two bytes!
  3     l_blob         BLOB;
  4     l_dest_offset  NUMBER := 1;
  5     l_src_offset   NUMBER := 1;
  6     l_lang_context NUMBER := 0;
  7     l_warning      NUMBER;
  8  BEGIN
  9     dbms_lob.createtemporary(l_blob, FALSE, dbms_lob.call);
 10     dbms_lob.converttoblob(dest_lob     => l_blob,
 11                            src_clob     => l_clob,
 12                            amount       => dbms_lob.lobmaxsize,
 13                            dest_offset  => l_dest_offset,
 14                            src_offset   => l_src_offset,
 15                            blob_csid    => nls_charset_id('AL32UTF8'),
 16                            lang_context => l_lang_context,
 17                            warning      => l_warning);
 18     dbms_output.put_line('byte length:'||dbms_lob.getlength(l_blob));
 19     dbms_lob.freetemporary(l_blob);
 20  END;
 21  /

byte length:11

PL/SQL procedure successfully completed

You can convert to any character set using the function nls_charset_id.

Vincent Malgrat
  • 66,725
  • 9
  • 119
  • 171
  • Thats what I'm actually doing before I want the length of my UTF-8 NCLOB. I've got a CLOB in the Windows-1252 charset and I've tried to call `CONVERT(my_clob, 'UTF8', 'WE8MSWIN1252')` to convert it to a NCLOB in UTF-8. But it turns out that `CONVERT(CLOB, VARCHAR2, VARCHAR2)` returns some pseudo UTF-8 CLOB (and not a NCLOB) with some odd padding that might be UTF-16... – Dave Aug 29 '13 at 13:41
  • @dave I don't think `convert` will do what you think it does. [`CONVERT`](http://docs.oracle.com/cd/E11882_01/server.112/e26088/functions034.htm#i77037) will return a CLOB that has the same raw character set as the first one, though the data will be different. I've added an example with `dbms_lob.converttoblob` that should work with all clobs. – Vincent Malgrat Aug 29 '13 at 13:54
  • Vincent, not sure why, but I get byte length: 15 for this example, where if I dump this string I get the expected length of 11. Not sure why the discrepancy though. – tbone Aug 29 '13 at 15:22
  • @tbone Wow that's strange =) What tool/OS/session nls_character_set did you use? Also how did you dump the string? (I've tried dump on a clob but it didn't work) – Vincent Malgrat Aug 29 '13 at 15:56
  • NLS_NCHAR_CHARACTERSET = AL16UTF16, NLS_CHARACTERSET = US7ASCII. I used dump on the string directly (without using utf-16) by doing: select dump('abcdéfghij', 16) from dual. If I assign this string to a nvarchar2 variable inside pl/sql, then dump, I get a length of 22 (double of 11, which is expected). The value of 15 is unexpected I admit (which was from your code of converting to BLOB first then using dbms_lob.getlength). – tbone Aug 29 '13 at 17:02
  • If I run same code as above, but use AL16UTF16 in converttoblob, then I get length of 22. So UTF-8 gives length:15, and UTF-16 gives 22. You must have run your code using US7ASCII as the charset in the BLOB conversion I think. – tbone Aug 29 '13 at 17:05
  • Regarding the code above, I get a length of 11 with NLS_CHARACTERSET "WE8MSWIN12523" and NLS_NCHAR_CHARACTERSET "UTF8" too. – Dave Aug 30 '13 at 06:00
  • It seems that `TO_NCLOB()` does the trick of converting from Windows-1252 to UTF-8 with my charsets... – Dave Aug 30 '13 at 06:24
  • @dave I think it's because "é" is not in the `US7ASCII` charset, so the data must be converted between your screen and the VARCHAR2. Still I don't understand why it displays 15. – Vincent Malgrat Aug 30 '13 at 08:04
  • I think it was tbone who used `US7ASCII` charset. I'm accepting your answer though, since the code with the conversion to `BLOB` worked for me. – Dave Aug 30 '13 at 09:09