When I'm trying to load a file to Hive, some of the characters are not getting interpreted properly and are coming up as boxes in hive.
Using the "file
" command, I get the file type as below:
ksh> file -bi data.TXT
text/plain; charset=us-ascii
Below is a data sample (opened in vim editor in a Putty terminal linux machine having Settings: Window>Translation>Remote character Set as UTF-8):
338498281INT057575238000ZENECA 176334 9
338498282INT015250238000<8d>4CYANO3(TRIFLUOROMETHYL)PHENYLÙ3<8d>4FLUOROPHENYL)SULFONYLHYDROXYMETHPRO 9
Below is a snapshot of the same (Notice the Blue color of <8d> character in 2nd row, which I want to remove, and keep the Ù
: U with Gravel )
I've tried removing it using
sed, tr
commands, but still the data comes as boxes.
cat data.TXT | tr -d '\215' > data_tr_215.TXT # Ù still coming as boxes in hive
I've also tried to convert it using Java's InputStreamReader and OutputStreamWriter , but even that gives the result as boxes : 000�4CYANO3(TRIFLUOROMETHYL)PHENYL�3�4FLUOROPHENYL)
scala> val fsDataInputStream = hdfs.open(new Path("data.TXT"))
fsDataInputStream: org.apache.hadoop.fs.FSDataInputStream = org.apache.hadoop.hdfs.client.HdfsDataInputStream@7789f96c: org.apache.hadoop.crypto.CryptoInputStream@b274834
scala> val isr = new InputStreamReader(fsDataInputStream)
isr: java.io.InputStreamReader = java.io.InputStreamReader@4e8b0758
scala> isr.getEncoding
res1: String = UTF8
How can I get rid of these charactes and still load the data properly having Ù
?