Byte-based length of CLR string

Question

Short version of the question: Is there a way of counting how many bytes will it take to store the characters of a string in a VARCHAR(n) column in an Oracle database?

Longer version: The following Oracle SQL script will fail at the third statement. It will try to insert 10 characters in a VARCHAR(10) column; however, one of these characters is the A with an acute accent.

CREATE TABLE TESTTABLE (NAME VARCHAR(10) NULL);

INSERT INTO TESTTABLE (NAME) VALUES ('abcdefghij');
--1 rows inserted.

INSERT INTO TESTTABLE (NAME) VALUES ('ábcdefghij');
--ORA-12899: value too large for column "ACME"."TESTTABLE"."NAME" (actual: 11, maximum: 10)

My C# application stores strings in an Oracle database and I can't just change the column type to NVARCHAR2(10), which would work. The application is expected to trim any larger string to a limit of 10 characters, so Oracle won't complain about its length. But a String.Length-based trim is a very naif strategy: it would blindly leave "ábcdefghij" intact with all its 10 CLR characters.

How can I detect that 'á' will take 2 bytes in the database row so I can trim the string to "ábcdefghi" before issuing the INSERT statement?

EDIT: this question is similar to Best way to shorten UTF8 string based on byte length

score 5 · Answer 1 · answered Mar 11 '13 at 20:02

5

Oracle function length(string) returns number of characters, lengthb(string) returns number of bytes.

answered Mar 11 '13 at 20:02

Egor Skriptunoff

23,359
2
34
64

+1 -- this is the only answer that will work regardless of Oracle's encoding configuration. – Billy ONeal Mar 11 '13 at 20:05
I didn't know about the `lengthb()` function, thank you very much. – Humberto Mar 11 '13 at 21:17

score 4 · Accepted Answer · answered Mar 11 '13 at 19:52

4

This depends on the encoding you have Oracle set to use. You convert the string instance into a byte array using the corresponding System.Text.Encoding instance, such as System.Text.Encoding.UTF8. (You're looking for the "GetBytes" method)

answered Mar 11 '13 at 19:52

Billy ONeal

104,103
58
317
552

This is the answer that best fits the situation, because I can only do so much on the database side of the problem. I'm limited to changing the C# app only. Thanks everybody! – Humberto Mar 11 '13 at 21:22

score 4 · Answer 3 · edited Mar 11 '13 at 20:36

4

By default, VARCHAR2(10) allocates 10 bytes of storage which may or may not equate to 10 characters depending on the database character set and the data. You can, however, declare the column so that it will always store 10 characters.

When you are using a variable-width character set, you almost always want to declare the column using character-length semantics VARCHAR2(10 CHAR). Then you can always store 10 characters whether you happen to have some characters that require more than 1 byte of storage.

If you're stuck declaring columns using byte-length semantics for some reason, you could use the LENGTHB or VSIZE function to return the length of the string in bytes.

select lengthb('ábcdefghij'), vsize('ábcdefghij')
  from dual;

will return 11 for both columns of the result.

edited Mar 11 '13 at 20:36

Nick Krasnov

26,886
6
61
78

answered Mar 11 '13 at 20:23

Justin Cave

227,342
24
367
384

`+1` for `VARCHAR2(10 CHAR)` – Mike Christensen Mar 11 '13 at 20:49
`VARCHAR2` would be a very nice idea if the DBA had thought about international characters before! And he should have done it, by the way. We have too many accented characters in Portuguese. – Humberto Mar 11 '13 at 21:20
1

@Humberto - `VARCHAR` and `VARCHAR2` are synonyms-- you can happily declare a `VARCHAR(10 CHAR)` column if you'd like (or alter an existing `VARCHAR(10 BYTE)` column to `VARCHAR(10 CHAR)`). – Justin Cave Mar 11 '13 at 21:31

score 1 · Answer 4 · answered Mar 11 '13 at 19:57

1

You can get the lenght in bytes of the string like this:

UTF8Encoding Encoding = new UTF8Encoding();
byte[] UTF8String = Encoding.GetBytes("ábcdefghij");
int StringLenght = UTF8String.Length

In fact, for your example, it returns 11.

answered Mar 11 '13 at 19:57

Rafael

2,827
1
16
17

+1 for code example of my answer. Though you should probably use the static Encoding.UTF8 instance rather than creating new UTF8Encoding instances. (That will result in less GC overhead) – Billy ONeal Mar 11 '13 at 20:07

Byte-based length of CLR string

4 Answers4