In my snowflake db, a table has non-utf8 characters. How can I create a view on top of it which will only have utf8 characters; either by excluding the rows with non-utf8 characters or by replacing them? Thanks
Asked
Active
Viewed 4,661 times
4
-
Can you post some sample data/code. – Roshan Nuvvula Nov 22 '19 at 05:47
2 Answers
5
It should be possible to check for non-utf with a test like this:
MY_STRING IS NOT NULL AND TRY_HEX_DECODE_STRING(HEX_ENCODE(MY_STRING)) IS NULL
But then I don't have data to test with.
To reencode the string to utf-8, you can use a JavaScript function:
CREATE OR REPLACE FUNCTION TO_UTF8(BINARY_TEXT BINARY)
RETURNS TEXT LANGUAGE JAVASCRIPT STRICT IMMUTABLE AS '
var win1252 = [ /* C1 controls */
8364, 129, 8218, 402, 8222, 8230, 8224, 8225,
710, 8240, 352, 8249, 338, 141, 381, 143,
144, 8216, 8217, 8220, 8221, 8226, 8211, 8212,
732, 8482, 353, 8250, 339, 157, 382, 376
];
return String.fromCharCode(
...Array.from(BINARY_TEXT).map(x => (x < 128 || x > 159) ? x : (win1252[x - 128]))
); /* .map(...) can be removed if no conversion from win1252 needed */
';
SELECT NVL(TRY_HEX_DECODE_STRING(HEX_ENCODE(MY_STRING)),
TO_UTF8(HEX_ENCODE(MY_STRING)::BINARY));

Hans Henrik Eriksen
- 2,690
- 5
- 12
-
This works! It detects rows with non-utf8 characters. Now, is there a way to replace those characters (so not to discard the row)? – elfersi Nov 22 '19 at 13:14
-
I've expanded my answer with an expression that replaces illegal bytes with space. Should be used when the test above fails. – Hans Henrik Eriksen Nov 22 '19 at 15:52
-
Thanks but `select * from table where TRY_HEX_DECODE_STRING(HEX_ENCODE(field)) is null` returns 3 rows while `select * from table where field <> TRY_HEX_DECODE_STRING( REPLACE( REGEXP_REPLACE( REGEXP_REPLACE(HEX_ENCODE(field), '(..)', '\\1,'), '([89A-F].,)', '20,'), ',', '' ) )` returns 484 rows. What are your thoughts? – elfersi Nov 22 '19 at 17:17
-
The expression above was only meant to be used for invalid utf-8. Instead, I made a JavaScript function to reencode a string, and even convert from win1252 to utf-8 if that is the real format of the string. Please see the answer above. – Hans Henrik Eriksen Nov 23 '19 at 00:02
2
Though I tried Hans's solution, it isn't working in all scenarios. Hence, I wrote a very simple Snowflake Javascript stored procedure that will replace all characters that are not part of keyboard.
CREATE OR REPLACE FUNCTION MYSCHEMA.REPLACE_NON_UTF8_CHARS(STR varchar)
RETURNS TEXT LANGUAGE JAVASCRIPT STRICT IMMUTABLE AS '
return STR.replace(/[^ -~]+/g, '''')
';
This can be implemented in the sql like below.
SELECT MYSCHEMA.REPLACE_NON_UTF8_CHARS(MY_STRING) AS MY_STRING
FROM
(
SELECT 'ABC,D�E123 5' AS MY_STRING
)
And the output will be like below. You can see the junk character after D
is replaced by empty character.
ABC,DE123 5

Sarath Subramanian
- 20,027
- 11
- 82
- 86