Is there a query in Snowflake to identify the characters in a file that are invalid utf8

Question

I have a file that when loading into Snowflake gets an error for invalid UTF-8 characters, I have managed to load it into a table using another encoding, by creating a file format with option ENCODING = 'iso-8859-1' but I would like to find a way to get those characters queried.

I've tried TO_BINARY(col,'UTF-8') function hoping it will fail on the col that has invalid UTF-8 but was not able to get a valid result to capture those characters, has anyone faced the same issue?

Have a look at this answer [How to find rows with non-UTF8 characters](https://stackoverflow.com/questions/58987978/how-to-find-rows-with-non-utf8-characters-in-snowflake/58988864#58988864) — NickW, Aug 19 '20 at 11:02
Thank you @NickW nice find, I did not get that result when searching! — Monem_منعم, Aug 19 '20 at 12:05

score 1 · Accepted Answer · answered Aug 19 '20 at 19:33

1

Please note that ALL character data within Snowflake is encoded using UTF-8. There is no other option. A while back, this was not strictly true, and it was possible to have character data in Snowflake that was NOT valid UTF-8. But that should not be possible now.

Specifying the ENCODING = 'iso-8859-1' option instructed Snowflake (during the COPY INTO operation) to perform character set translation on the file (which was then interpreted as being encoded in ISO-8859-1), mapping all characters into their UTF-8 equivalent as it was written into Snowflake. As a result, all data in Snowflake is UTF-8 encoded, and therefore there should not be ANY non-UTF-8 characters to discover. That said, the result of character set translation might not end up translating to the correct/expected UTF-8 characters if the underlying (source) file was not truly encoded with the encoding that you specified during the COPY INTO (in this case, ISO-8859-1).

Given this, what is the ultimate problem that you are trying to resolve here? Did you load a source file with ENCODING = 'iso-8859-1' that was not actually ISO-8859-1? Or are you saying that the source file WAS truly encoded as ISO-8859-1, and yet somehow the resulting characters in Snowflake are either (1) incorrect or (2) invalid UTF-8? Or are you trying to determine the actual encoding of a source file (ignoring the whole ISO-8859-1 aspect altogether)?

answered Aug 19 '20 at 19:33

Darren Gardner

1,094
4
6

Thank you @darren for the full explanation you've added. – Monem_منعم Aug 19 '20 at 23:53
The file was indeed in ISO-8859-1 and I've loaded it thought it will preserve the encoding but as you mentioned here it will get translated into UTF-8 which make more sense now. But in another file, I find it as encoded UTF-8 and I get invalid UTF-8 which in notepad++ using show all characters I find some like xA0 so I had to load it using that same ISO-8859-1 encoding and loaded all rows in the file, is that means they got translated to UTF-8 as well? – Monem_منعم Aug 20 '20 at 00:07
Which operating system are you using? If Linux, you can use ad-hoc techniques like the ones described here (https://stackoverflow.com/questions/48729215/how-to-check-character-encoding-of-a-file-in-linux) to "determine" (not 100%, but often very close) the encoding of a file if you do not know what encoding was used to produce it. Other operating systems have their own techniques for this (if you do not have Linux). I would suggest verifying the second file to see if it indeed is UTF-8 or some other encoding. – Darren Gardner Aug 20 '20 at 00:53
If it truly is UTF-8, and characters are not preserved during a load into Snowflake, I would recommend that you open a case with Snowflake Support, since that should NEVER happen. – Darren Gardner Aug 20 '20 at 00:53
I'm using WindowsOS and it was produced using WindowsOS as well, thanks for the tips. It was a quick and one-time load for our POC so did not spend time to validate the file thoroughly as I was concerned to load all rows. Probably, next time I would verify the encoding and open a case as you suggested if I think it is for sure a UTF-8 one – Monem_منعم Aug 20 '20 at 01:59

score 0 · Answer 2 · answered Aug 19 '20 at 12:08

0

Detailed answer found here How to find rows with non utf8 characters in Snowflake?

Should mark my question as duplicate and refer to the link, please.

answered Aug 19 '20 at 12:08

Monem_منعم

336
2
10

Is there a query in Snowflake to identify the characters in a file that are invalid utf8

2 Answers2