Issue in data-length while processing data from Hive to Blob file to Azure DataWarehouse

Question

I am selecting a column data in Hive by using substring function and specifying the length of 3999. I am storing the selected columns in a blob file on Azure, and then trying to load the file into Azure DataWarehouse using Azure Data Factory. Now I am encountering errors for a few rows, where it says datalength has exceeded 3999 (MaxLength).

So for troubleshooting, I did substring of length 2000 in Hive and saved the data to file. This time, I did not receive any errors. However, when I checked the data in the column in DataWarehouse, some of the elements have length more than 2000. This is mostly happening due to data coming in chinese characters.

score 0 · Answer 1 · answered Jan 10 '19 at 10:34

0

This is because some characters (like the chinese characters) are encoded in several bytes in many popular formats, like UTF-8. Your length test before pushing to the data factory counts the number of UTF-8 characters, while the data factory itself counts the number of bytes (or ascii characters).

Change encoding, or measure the length of your string in bytes before pushing.

answered Jan 10 '19 at 10:34

Nino Filiu

16,660
11
54
84

How do I proceed by selecting only required number of bytes from Hive. I only know of the substring function. I can not change the encoding of the tables. – shubham Jan 10 '19 at 11:20
Check this answer on [calculate the byte length of a string](https://stackoverflow.com/questions/25994001/how-to-calculate-byte-length-containing-utf8-characters-using-javascript). Finding the best 3999-bytes long substring is easy from there ;) – Nino Filiu Jan 10 '19 at 12:41

Issue in data-length while processing data from Hive to Blob file to Azure DataWarehouse

1 Answers1