7

We use Oracle 10g and Oracle 11g.

We also have a layer to automatically compose queries, from pseudo-SQL code written in .net (something like SqlAlchemy for Python).

Our layer currently wraps any string in single quotes ' and, if contains non-ANSI characters, it automatically compose the UNISTR with special characters written as unicode bytes (like \00E0).

Now we created a method for doing multiple inserts with the following construct:
INSERT INTO ... (...) SELECT ... FROM DUAL UNION ALL SELECT ... FROM DUAL ...

This algorithm could compose queries where the same string field is sometimes passed as 'my simple string' and sometimes wrapped as UNISTR('my string with special chars like \00E0').

The described condition causes a ORA-12704: character set mismatch.

One solution is to use the INSERT ALL construct but it is very slow compared to the one used now.

Another solution is to instruct our layer to put N in front of any string (except for the ones already wrapped with UNISTR). This is simple.

I just want to know if this could cause any side-effect on existing queries.

Note: all our fields on DB are either NCHAR or NVARCHAR2.


Oracle ref: http://docs.oracle.com/cd/B19306_01/server.102/b14225/ch7progrunicode.htm

Teejay
  • 7,210
  • 10
  • 45
  • 76
  • 1
    You could also cast if you know the target column size. Or your layer might support a proper bulk insert mechanism. But surely using `n'...'` just avoids an implicit conversion of the literal during the insert, from your database character set to the national character set? – Alex Poole Mar 03 '16 at 14:44
  • @AlexPoole Sincerly, I don't understand your question... – Teejay Mar 03 '16 at 15:16
  • 1
    How many rows are being inserted per statement? If `INSERT ALL` is slower than `UNION ALL` you may be running into an Oracle parsing problem, as explained in my answer [here](http://stackoverflow.com/a/11663076/409172). It may be enough to break the `INSERT ALL` into smaller chunks to avoid the long parse times of huge SQL statements. – Jon Heller Mar 08 '16 at 15:56
  • @JonHeller At the moment, I set the row-per-query constant to `100`. With this value, I can get 81.000 rows (x 23 columns) inserted in ~21 secs. – Teejay Mar 08 '16 at 16:50

3 Answers3

2

Basicly what you are asking is, is there a difference between how a string is stored with or without the N function.

You can just check for yourself consider:

SQL> create table test (val nvarchar2(20));

Table TEST created.

SQL> insert into test select n'test' from dual;

1 row inserted.

SQL> insert into test select 'test' from dual;

1 row inserted.

SQL> select dump(val) from test;
DUMP(VAL)                                                                      
--------------------------------------------------------------------------------
Typ=1 Len=8: 0,116,0,101,0,115,0,116                                            
Typ=1 Len=8: 0,116,0,101,0,115,0,116  

As you can see identical so no side effect.

The reason this works so beautifully is because of the elegance of unicode

If you are interested here is a nice video explaining it

https://www.youtube.com/watch?v=MijmeoH9LT4

Neil Barsema
  • 201
  • 1
  • 4
  • Could there be a decrement in performance applying N everywhere on string literals? – Teejay Mar 08 '16 at 14:38
  • 2
    "Could there be a decrement in performance applying N everywhere on string literals?" No, it couldn't because any char value inserted into nchar column converted to nchar implicitly or explicitly. – Mikhailov Valentin Mar 08 '16 at 15:22
  • @MikhailovValentine Thanks. So, with `N`, I am just *explicit-ing* a process that happens implicitly anyway? – Teejay Mar 08 '16 at 16:52
2

I assume that you get an error "ORA-12704: character set mismatch" because your data inside quotes considered as char but your fields is nchar so char is collated using different charsets, one using NLS_CHARACTERSET, the other NLS_NCHAR_CHARACTERSET.

When you use an UNISTR function, it converts data from char to nchar (in any case that also converts encoded values into characters) as the Oracle docs say:

"UNISTR takes as its argument a text literal or an expression that resolves to character data and returns it in the national character set."

When you convert values explicitly using N or TO_NCHAR you only get values in NLS_NCHAR_CHARACTERSET without decoding. If you have some values encoded like this "\00E0" they will not be decoded and will be considered unchanged.

So if you have an insert such as:

   insert into  select N'my string with special chars like \00E0', 
    UNISTR('my string with special chars like \00E0') from dual ....

your data in the first inserting field will be: 'my string with special chars like \00E0' not 'my string with special chars like à'. This is the only side effect I'm aware of. Other queries should already use NLS_NCHAR_CHARACTERSET encoding, so it shouldn't be any problem using an explicit conversion.

And by the way, why not just insert all values as N'my string with special chars like à'? Just encode them into UTF-16 (I assume that you use UTF-16 for nchars) first if you use different encoding in 'upper level' software.

dipdapdop
  • 126
  • 1
  • 10
Mikhailov Valentin
  • 1,092
  • 3
  • 16
  • 23
  • *"I assume that you get an error "ORA-12704: character set mismatch" because your data inside quotes considered as char but your fields is nchar"* No, I get the error because I'm mixing non-unicode and unicode text with `UNION ALL`. – Teejay Mar 08 '16 at 14:55
  • *"But if you have some values encoded like this "\00E0" they will not be decoded and will be considered as is."* Strings containing special characters are automatically wrapped with `UNISTR` by our layer, others are not. This is why the mixing appens, and this is why I need N for other strings. – Teejay Mar 08 '16 at 14:58
  • *"And by the way, why not just insert all values as `N'my string with special chars like à'`"* So you're saying that there is no difference between using `UNISTR('\00E0')` and `N'à'`? – Teejay Mar 08 '16 at 15:01
  • 1
    " is no difference between using UNISTR('\00E0') and N'à'" At least when I insert using JDBC or SQL Developer I cant find a differance. – Mikhailov Valentin Mar 08 '16 at 15:08
  • "No, I get the error because I'm mixing non-unicode and unicode text with UNION ALL." Try to create a test table with char and nchar columns and try to make insert using N'string' for n char and 'string' for char - it will work. – Mikhailov Valentin Mar 08 '16 at 15:14
  • Yes indeed, because they are two separate columns. In my case, when using `UNION ALL`, the values on each sub-`SELECT` statement merge in a single column before being inserted, so they must have the same char set. – Teejay Mar 08 '16 at 16:56
-1
  • use of n function - you have answers already above.

If you have any chance to change the charset of the database, that would really make your life easier. I was working on huge production systems, and found the trend that because of storage space is cheap, simply everyone moves to AL32UTF8 and the hassle of internationalization slowly becomes the painful memories of the past.

I found the easiest thing is to use AL32UTF8 as the charset of the database instance, and simply use varchar2 everywhere. We're reading and writing standard Java unicode strings via JDBC as bind variables without any harm, and fiddle.

Your idea to construct a huge text of SQL inserts may not scale well for multiple reasons:

  • there is a fixed length of maximum allowed SQL statement - so it won't work with 10000 inserts
  • it is advised to use bind variables (and then you don't have the n'xxx' vs unistr mess either)
  • the idea to create a new SQL statement dynamically is very resource unfriedly. It does not allow Oracle to cache any execution plan for anything, and will make Oracle hard parse your looong statement at each call.

What you're trying to achieve is a mass insert. Use the JDBC batch mode of the Oracle driver to perform that at light-speed, see e.g.: http://viralpatel.net/blogs/batch-insert-in-java-jdbc/

Note that insert speed is also affected by triggers (which has to be executed) and foreign key constraints (which has to be validated). So if you're about to insert more than a few thousands of rows, consider disabling the triggers and foreign key constraints, and enable them after the insert. (You'll lose the trigger calls, but the constraint validation after insert can make an impact.)

Also consider the rollback segment size. If you're inserting a million of records, that will need a huge rollback segment, which likely will cause serious swapping on the storage media. It is a good rule of thumb to commit after each 1000 records.

(Oracle uses versioning instead of shared locks, therefore a table with uncommitted changes are consistently available for reading. The 1000 records commit rate means roughly 1 commit per second - slow enough to benefit of write buffers, but quick enough to not interfer with other humans willing to update the same table.)

Gee Bee
  • 1,794
  • 15
  • 17
  • *"there is a fixed length of maximum allowed SQL statement - so it won't work with 10000 inserts"*, simply not true. Oracle **has not fixed length limits**, see http://stackoverflow.com/questions/14355819/what-is-the-maximum-statement-length-in-oracle. BTW, our layer automatically splits query in pre-defined size, so we do not need to worry about such things. – Teejay Mar 14 '16 at 08:25
  • *"What you're trying to achieve is a mass insert. Use the JDBC batch mode of the Oracle driver"* I know there are ways to bulk insert, i.e. starting from a formatted text file, but this is simply not the case. Our layer also compose queries for SqlServer and Postgres. BTW, no one mentioned Java, we work with .net. – Teejay Mar 14 '16 at 08:31
  • *"consider disabling the triggers and foreign key constraints"* We have no triggers in our configuration. Anyway, please note that triggers are often something you can't do without, especially if they do data modifications. – Teejay Mar 14 '16 at 08:38
  • *"consider disabling the triggers and foreign key constraints"* Foreign key constraints may have impact on data insertion rate, but if you disable them, you need to re-enable them later... and **the time to re-enable (and check) is comparable to time saved during insertion**. – Teejay Mar 14 '16 at 08:42
  • *"Also consider the rollback segment size."* No one mentioned transactions. Anyway they are there to be used. Committing @ 1000 records threshold, in some applications, is ridicolous. – Teejay Mar 14 '16 at 08:44
  • @Teejay, I am sorry if you misunderstood my answer. I suggested a much easier solution for globalization than nvarchars (which actually works in any unicode environment, including .net) and tried to solve your problem with slowness. If these are not relevant for you, feel free to ignore from the question. For the records, transactions *are* there all the time, even implicitly or explicitly. I found that 1000 transactions commit is pretty standard in bulk loading of production enterprise Oracle databases, which I am working with in the last 15 years. – Gee Bee Mar 14 '16 at 12:27
  • Sorry for misinterpreting your answer and underestimating your experience. Please slightly edit your question, so I can remove the downvote (it is locked). Unfortunately, changing the approach is not an option here. BTW, `INSERT INTO...UNION ALL` is quite fast, at least for our needings. – Teejay Mar 14 '16 at 13:38