2

I am trying to load a csv with pipe delimiter to an hive external table. The pipe occurring within data fields are enclosed within quotes. Double quotes occurring within data are escaped with \ . When I configure external table, I see data with double quotes are not interpreted properly.

test.csv

id|name
105|"Test | pipe delim in field"
107|\" Test Escaped single double quote in HIVE
108|\" Test Escaped enclosed double quote in HIVE \"
109|\\" Test Escaped enclosed double quote in HIVE \"
110|\\" Test Escaped enclosed double quote in HIVE \\"

External table create statement

drop table test_schema.hive_test;
CREATE EXTERNAL TABLE test_schema.hive_test (id string, name string) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES
(
"separatorChar" = "|",
"quoteChar" = "\"",
"escapeChar" = "\\"
)
LOCATION '/staging/test/hive'
tblproperties ("skip.header.line.count"="1");

Output

+---------------+-------------------------------------------------+
| hive_test.id  |                 hive_test.name                  |
+---------------+-------------------------------------------------+
| 105           | Test | pipe delim in field                      |
| 107           | NULL                                            |
| 108           | NULL                                            |
| 109           | NULL                                            |
| 110           | " Test Escaped enclosed double quote in HIVE \  |
+---------------+-------------------------------------------------+

Expected Output

+---------------+-------------------------------------------------+
| hive_test.id  |       hive_test.name                            |
+---------------+-------------------------------------------------+
| 105           | Test | pipe delim in field                      |
| 107           | " Test Escaped single double quote in HIVE      |
| 108           | " Test Escaped enclosed double quote in HIVE "  |
| 109           | NULL                                            |
| 110           | NULL                                            |
+---------------+-------------------------------------------------+

Open CSV version 2.3

Naresh S
  • 765
  • 2
  • 10
  • 19
  • 2
    It seems to be a problem: https://stackoverflow.com/questions/49979309/character-slash-is-not-being-read-by-hive-on-using-opencsvserde – Chema Aug 06 '20 at 11:56
  • I tried with additional lines 109 & 110 with double backslash as suggested in one of the solutions shared by you . I see 110 shows up value but again the second double quote is not showing up properly. Update the question with the tries – Naresh S Aug 06 '20 at 14:25

1 Answers1

0

Sadly it is not possible to achieve because OpenCSV use single character as escape and really you are trying to use double backslash as escape character (which would be a string). In the OpenCSVSerde class you can find that no matter what you pass as escape character,OpenCSVSerde get the first character of your string value https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/OpenCSVSerde.java#L98

Here is the current code as reference

  private char getProperty(final Properties tbl, final String property, final char def) {
final String val = tbl.getProperty(property);

if (val != null) {
  return val.charAt(0);
}

return def;

}

I think is missing a warning to let the user know at the moment of the table creation that only single characters are supported.

hlagos
  • 7,690
  • 3
  • 23
  • 41