0

I am selecting a column data in Hive by using substring function and specifying the length of 3999. I am storing the selected columns in a blob file on Azure, and then trying to load the file into Azure DataWarehouse using Azure Data Factory. Now I am encountering errors for a few rows, where it says datalength has exceeded 3999 (MaxLength).

So for troubleshooting, I did substring of length 2000 in Hive and saved the data to file. This time, I did not receive any errors. However, when I checked the data in the column in DataWarehouse, some of the elements have length more than 2000. This is mostly happening due to data coming in chinese characters.

shubham
  • 53
  • 2
  • 5

1 Answers1

0

This is because some characters (like the chinese characters) are encoded in several bytes in many popular formats, like UTF-8. Your length test before pushing to the data factory counts the number of UTF-8 characters, while the data factory itself counts the number of bytes (or ascii characters).

Change encoding, or measure the length of your string in bytes before pushing.

Nino Filiu
  • 16,660
  • 11
  • 54
  • 84
  • How do I proceed by selecting only required number of bytes from Hive. I only know of the substring function. I can not change the encoding of the tables. – shubham Jan 10 '19 at 11:20
  • Check this answer on [calculate the byte length of a string](https://stackoverflow.com/questions/25994001/how-to-calculate-byte-length-containing-utf8-characters-using-javascript). Finding the best 3999-bytes long substring is easy from there ;) – Nino Filiu Jan 10 '19 at 12:41