Character Encoding issue while reading from Hive table files

Question

I am facing a mind boggling (to me) issue while trying to read ORC files. By default the Hive orc files are in "UTF-8" encoding, or at least supposed to be. I am doing a copyToLocal of the ORC files and trying to read the ORC file in Java.

I am able to read the file successfully although it has some unwanted characters:

When I am querying the table in hive, no unwanted characters:

Can anyone please help? I have tried decoding and encoding in various formats like (ISO-8859-1 to UTF-8),(UTF-8 to ISO-8859-1),(ISO-8859-1 to UTF-16) etc.

Edit:

Hi, I am using the below java code to read ORC file:

    import org.apache.hadoop.hive.ql.io.orc.Reader;
import org.apache.hadoop.hive.ql.io.orc.RecordReader;
import org.apache.hadoop.hive.serde2.objectinspector.StructField;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;

public class OrcFormat {
    public static void main(String[] argv)
    {
        System.out.println(System.getProperty("file.encoding").toString());
        System.out.println(Charset.defaultCharset().name());

        try {
            Configuration conf = new Configuration();
            Utils.createFile("C:/path/target","opfile.txt","UTF-8");
            Reader reader = OrcFile.createReader(new Path("C:/input/000000_0"),OrcFile.readerOptions(conf));

            StructObjectInspector inspector = (StructObjectInspector)reader.getObjectInspector();

            List<String> keys = reader.getMetadataKeys();
            for(int i=0;i<keys.size();i++){
                System.out.println("Key:"+keys.get(i)+",Value:"+reader.getMetadataValue(keys.get(i)));
            }


            RecordReader records = reader.rows();
            Object row = null;

            List fields = inspector.getAllStructFieldRefs();
            for(int i = 0; i < fields.size(); ++i) {
                System.out.print(((StructField)fields.get(i)).getFieldObjectInspector().getTypeName() + '\t');

            }
            System.out.println();
            int rCnt=0;
            while(records.hasNext())
            {
                row = records.next(row);
                List value_lst = inspector.getStructFieldsDataAsList(row);
                String out = "";

                for(Object field : value_lst) {
                    if(field != null)
                        out+=field;
                    out+="\t";
                }
                rCnt++;

                out = out+"\n";
                byte[] outA = convertEncoding(out,"UTF-8","UTF-8");
                Utils.writeToFile(outA,"C:/path/target","opfile.txt","UTF-8");
                if(rCnt<10){
                    System.out.println(out);
                    System.out.println(new String(outA));
                }else{
                    break;
                }
            }
        }catch (Exception e)
        {
            e.printStackTrace();
        }
    }   

    public static byte[] convertEncoding(String s,String inCharset,String outCharset){
        Charset inC = Charset.forName(inCharset);
        Charset outC = Charset.forName(outCharset);
        ByteBuffer inpBuffer = ByteBuffer.wrap(s.getBytes());
        CharBuffer data = inC.decode(inpBuffer);

        ByteBuffer opBuffer = outC.encode(data);
        byte[] opData = opBuffer.array();
        return opData;
    }
}

This seems like a Java issue and not Hive issue. It doesn't seems like you are using UTF-8 in your Java application but instead you are using a single byte encoding therefore every other byte is '\0'. Please add your java code. — David דודו Markovitz, Feb 13 '17 at 08:58
I agree on that but need help on how to resolve it. I am using OrcReader to read the ORC file. I have added the java code I am using. — vhora, Feb 13 '17 at 09:14
BTW, your code might be O.K. - how do you actually look on the file created? text editor? — David דודו Markovitz, Feb 13 '17 at 09:19
Yes, Notepad++/TextPad. The issue is that This data is being fed to SQL Server which does not support UTF-8, so I have to convert the encoding to ISO-8895-1, But then I am unable to get rid of nul or \0 character. Any pointers? I can write a String.replaceAll("[^ -~\t\r\n]","") but I dont think that would be correct. — vhora, Feb 13 '17 at 09:26
So this is actually not a post about Hive or Java but about loading UTF-8 text into SQL Sever... Check http://stackoverflow.com/questions/12512687/sql-server-utf8-howto — David דודו Markovitz, Feb 13 '17 at 09:32
Your file seems to be saved as UTF-16 (fixed length of two bytes per a character). — JosefZ, Feb 13 '17 at 09:34
When I am re-reading the generated UTF-8 file with UTF-8 encoding I am still getting the \0 character. I tried with UTF-16 to UTF-8 conversion but no difference. When I read the file in excel, it ignores the \0 character and I can see valid data. So do application ignore the \0 character in UTF-8 files? — vhora, Feb 13 '17 at 09:46
And yes, this is an issue with Multibyte characters as the legacy system was using UTF-16 and that data has been put into Hadoop (UTF-8). Is that causing the issue. If so how can I read the data correctly. — vhora, Feb 13 '17 at 09:48

Character Encoding issue while reading from Hive table files

0 Answers0