Output Clause Adds Extra Characters Not In Originating source

Question

After an insert into a temp table using the OUTPUT clause, I am getting an extra character(s) that are not in the source.

Why?

--

E.g. after insertion from existing PhysicalTable_1 table record with LName = 'John' the destination PhysicalTable_1 table as well as the #Temp table have '?John' or 'I?John' it occurs sometime for Lname, sometime for FName or Email as well as other fields.

An example of data in the PhysicalTable_1 - FName = 'Raul' And destination record after insertion looks like = '?Raul'

I'm using this:

 CREATE TABLE #Temp 
    (
    ID INT NOT NULL,
    LName VARCHAR(75) NULL,
    FName VARCHAR(75) NULL,
    Email VARCHAR(125) NULL
    )
    
    CREATE TABLE PhysicalTable_2
    (
    ID INT NOT NULL,
    LName VARCHAR(75) NULL,
    FName VARCHAR(75) NULL,
    Email VARCHAR(125) NUL
    )
    
 CREATE TABLE PhysicalTable_1
    (
    ID INT NOT NULL,
    LName NVARCHAR(500) NULL,
    FName NVARCHAR(500) NULL,
    Email NVARCHAR(500) NULL
    )

    INSERT INTO PhysicalTable_2
    (
      LName, FName, Email
    )
    OUTPUT INSERTED.LName, INSERTED.FName, INSERTED.Email
    INTO #Temp
    
    SELECT LName, FName, Email
    FROM PhysicalTable_1

I also tried to change all string fields data types of #Temp table to NVARCHAR. Still some records in the destination ended up having extra characters

Hi Mukesh, an example of data in the PhysicalTable_1 - FName = 'Raul' And destination record after insertion looks like = '?Raul'. I copied the values directly from the PhysalTable_1 and then destination table ( PhysalTable_2) — Data Engineer, Dec 06 '19 at 05:31
Are you facing this issue with the output clause? or even when you are directly inserting the records?. try to export the source table data and see the result. — Mukesh Arora, Dec 06 '19 at 05:42
Similar problem is described here https://stackoverflow.com/questions/27061810/unicode-to-non-unicode-conversion — Data Engineer, Dec 07 '19 at 04:00

lptr · Accepted Answer · 2019-12-09T11:58:15.440

The problem is that your PhysicalTable_1 contains non-printable unicode characters in LName. You insert the unicode LName NVARCHAR column of Table1, into an ascii/nonunicode LName VARCHAR column of Table2. Nonunicode is half the size of unicode in sql server, some bytes have to be "cut-off" and because of the size reduction the non-printable characters become apparent.

--characters to binary
SELECT CAST(N'P' AS VARBINARY(10)) AS UnicodeP, CAST('P' AS VARBINARY(10)) AS AsciiP --unicode is double the size of ascii


CREATE TABLE #temp(UnicodeP NVARCHAR(10), AsciiP VARCHAR(10));

INSERT INTO #temp(UnicodeP, AsciiP) VALUES (N'P', 'P'); --nothing special, normal insertion
INSERT INTO #temp(UnicodeP, AsciiP) VALUES ('P', N'P'); --omitting the N for unicode and using N for ascii, still works ok, implicit conversion

SELECT * FROM #temp;

--use binary from the very first SELECT CAST(....
INSERT INTO #temp(UnicodeP, AsciiP) VALUES (0x5000, 0x50); --still fine
SELECT * FROM #temp;

--prepend a nonprintable character (BOM) to unicode P, just insert into the UnicodeP only
INSERT INTO #temp(UnicodeP, AsciiP) VALUES (0xFEFF + 0x5000, NULL); --still fine
SELECT * FROM #temp;

--if you copy and paste the last UnicodeP, where AsciiP is NULL, you will not notice any visual difference


--update the ascii from unicode , where ascii is null
UPDATE #temp
SET AsciiP = UnicodeP --implicit conversion, ascii is half the unicode, some bytes have to go away
WHERE AsciiP IS NULL;

--since unicode was implicitly converted to ascii,  some bytes are "stripped out"  The nonprintable 0xFEFF needs to be "cut in half" and it becomes an unidentified char
SELECT UnicodeP, CAST(UnicodeP AS VARBINARY(10)) AS UnicodePbinary, AsciiP, CAST(AsciiP AS VARBINARY(10)) as AsciiPbinary
FROM #temp;


DROP TABLE #temp;

*edit, implicit unicode to nonunicode and asciiOrnothing

SELECT NCHAR(rownum) AS TheChar, CAST(NCHAR(rownum) AS CHAR(1)) AS ImplicitConversion, 
    CASE WHEN NCHAR(rownum) < N'Ā' collate Latin1_General_BIN2 THEN NCHAR(rownum) ELSE '' END AS AsciiOrNothing,
    UNICODE(NCHAR(rownum)) AS CharInteger,
    --or
    CASE WHEN UNICODE(/*TheChar*/ NCHAR(rownum)) <= 255 THEN NCHAR(rownum) ELSE '' END AS AsciiOrNothing2
FROM 
(
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT null)) AS rownum
FROM (
    --10K
    SELECT TOP (100) name from master.dbo.spt_values) AS a
    CROSS JOIN (SELECT TOP (100) name from master.dbo.spt_values) AS b
) AS src
ORDER BY rownum

Thanks for the wonderful use case Iptr, But having all this explained what should be done to avoid such a conversion to avoid addition of extra characters being inserted into non-unicode filed from a UNICODE one? — Data Engineer, Dec 07 '19 at 03:55
...in such case, the 'safest' approach would be to preserve all ascii characters (from the unicode string) and discard all others . Note: some unicode chars can be "converted" to ascii but that does not mean that the "conversion" is correct, for example, some of the greek chars in SELECT 'αβγδεζμνξπρστυφχ' are "translated" to latin, μ remains the same and others are just "unknown". The same goes for SELECT 'Ǎ, Ǡ, Č, Ƈ, ň, ŉ' the first accented A is "translated" to A when the second is "unknown". — lptr, Dec 09 '19 at 11:28
you could do that (i.e to replace unicode chars with an empty string) by using regular expressions or looping through each char with tsql. An easy way to find out if a char is ascii is to compare it with N'Ā' in a binary collation (Ā is the first char outside the extended ascii). The code in the answer has been edited to include an example. You could use a combination if it suits you (keep the asciiOrNothing when not empty or when empty then the implicitly converted when not '?' ) — lptr, Dec 09 '19 at 11:45

Output Clause Adds Extra Characters Not In Originating source

1 Answers1