1

When I'm trying to load a file to Hive, some of the characters are not getting interpreted properly and are coming up as boxes in hive.

Data in hive

Using the "file" command, I get the file type as below:

ksh> file -bi data.TXT
text/plain; charset=us-ascii

Below is a data sample (opened in vim editor in a Putty terminal linux machine having Settings: Window>Translation>Remote character Set as UTF-8):

338498281INT057575238000ZENECA 176334                                                          9
338498282INT015250238000<8d>4CYANO3(TRIFLUOROMETHYL)PHENYLÙ3<8d>4FLUOROPHENYL)SULFONYLHYDROXYMETHPRO 9

Below is a snapshot of the same (Notice the Blue color of <8d> character in 2nd row, which I want to remove, and keep the Ù : U with Gravel )

Data Sample I've tried removing it using sed, tr commands, but still the data comes as boxes.

cat data.TXT | tr -d '\215'  > data_tr_215.TXT  # Ù still coming as boxes in hive 

I've also tried to convert it using Java's InputStreamReader and OutputStreamWriter , but even that gives the result as boxes : 000�4CYANO3(TRIFLUOROMETHYL)PHENYL�3�4FLUOROPHENYL)

scala> val fsDataInputStream = hdfs.open(new Path("data.TXT"))
fsDataInputStream: org.apache.hadoop.fs.FSDataInputStream = org.apache.hadoop.hdfs.client.HdfsDataInputStream@7789f96c: org.apache.hadoop.crypto.CryptoInputStream@b274834

scala> val isr = new InputStreamReader(fsDataInputStream)
isr: java.io.InputStreamReader = java.io.InputStreamReader@4e8b0758

scala> isr.getEncoding
res1: String = UTF8

How can I get rid of these charactes and still load the data properly having Ù ?

Balkrishan Aggarwal
  • 613
  • 1
  • 6
  • 20
  • 1
    it seems that it isn't really US-ASCII format https://stackoverflow.com/questions/11303405/force-encode-from-us-ascii-to-utf-8-iconv it's just an educated guess. What is the source of your input files? – Tschallacka Feb 20 '18 at 13:40
  • The presence of `<8d>` means your file is neither UTF-8 nor ASCII. Try the `"windows-1252"` encoding. – VGR Feb 20 '18 at 14:57
  • @Tschallacka Yes, it looks like it's not really 7-bit us-ascii as it says. It looks more like ISO-8859-1. mattpr 's answer https://stackoverflow.com/a/36030982/4604453 helped to identify it better. Trying out the solution. – Balkrishan Aggarwal Feb 21 '18 at 14:03

0 Answers0