1

I have a file that when loading into Snowflake gets an error for invalid UTF-8 characters, I have managed to load it into a table using another encoding, by creating a file format with option ENCODING = 'iso-8859-1' but I would like to find a way to get those characters queried.

I've tried TO_BINARY(col,'UTF-8') function hoping it will fail on the col that has invalid UTF-8 but was not able to get a valid result to capture those characters, has anyone faced the same issue?

Monem_منعم
  • 336
  • 2
  • 10
  • 1
    Have a look at this answer [How to find rows with non-UTF8 characters](https://stackoverflow.com/questions/58987978/how-to-find-rows-with-non-utf8-characters-in-snowflake/58988864#58988864) – NickW Aug 19 '20 at 11:02
  • Thank you @NickW nice find, I did not get that result when searching! – Monem_منعم Aug 19 '20 at 12:05

2 Answers2

1

Please note that ALL character data within Snowflake is encoded using UTF-8. There is no other option. A while back, this was not strictly true, and it was possible to have character data in Snowflake that was NOT valid UTF-8. But that should not be possible now.

Specifying the ENCODING = 'iso-8859-1' option instructed Snowflake (during the COPY INTO operation) to perform character set translation on the file (which was then interpreted as being encoded in ISO-8859-1), mapping all characters into their UTF-8 equivalent as it was written into Snowflake. As a result, all data in Snowflake is UTF-8 encoded, and therefore there should not be ANY non-UTF-8 characters to discover. That said, the result of character set translation might not end up translating to the correct/expected UTF-8 characters if the underlying (source) file was not truly encoded with the encoding that you specified during the COPY INTO (in this case, ISO-8859-1).

Given this, what is the ultimate problem that you are trying to resolve here? Did you load a source file with ENCODING = 'iso-8859-1' that was not actually ISO-8859-1? Or are you saying that the source file WAS truly encoded as ISO-8859-1, and yet somehow the resulting characters in Snowflake are either (1) incorrect or (2) invalid UTF-8? Or are you trying to determine the actual encoding of a source file (ignoring the whole ISO-8859-1 aspect altogether)?

Darren Gardner
  • 1,094
  • 4
  • 6
  • Thank you @darren for the full explanation you've added. – Monem_منعم Aug 19 '20 at 23:53
  • The file was indeed in ISO-8859-1 and I've loaded it thought it will preserve the encoding but as you mentioned here it will get translated into UTF-8 which make more sense now. But in another file, I find it as encoded UTF-8 and I get invalid UTF-8 which in notepad++ using show all characters I find some like xA0 so I had to load it using that same ISO-8859-1 encoding and loaded all rows in the file, is that means they got translated to UTF-8 as well? – Monem_منعم Aug 20 '20 at 00:07
  • Which operating system are you using? If Linux, you can use ad-hoc techniques like the ones described here (https://stackoverflow.com/questions/48729215/how-to-check-character-encoding-of-a-file-in-linux) to "determine" (not 100%, but often very close) the encoding of a file if you do not know what encoding was used to produce it. Other operating systems have their own techniques for this (if you do not have Linux). I would suggest verifying the second file to see if it indeed is UTF-8 or some other encoding. – Darren Gardner Aug 20 '20 at 00:53
  • If it truly is UTF-8, and characters are not preserved during a load into Snowflake, I would recommend that you open a case with Snowflake Support, since that should NEVER happen. – Darren Gardner Aug 20 '20 at 00:53
  • I'm using WindowsOS and it was produced using WindowsOS as well, thanks for the tips. It was a quick and one-time load for our POC so did not spend time to validate the file thoroughly as I was concerned to load all rows. Probably, next time I would verify the encoding and open a case as you suggested if I think it is for sure a UTF-8 one – Monem_منعم Aug 20 '20 at 01:59
0

Detailed answer found here How to find rows with non utf8 characters in Snowflake?

Should mark my question as duplicate and refer to the link, please.

Monem_منعم
  • 336
  • 2
  • 10