1

I am trying to load a csv with pipe delimiter to an hive external table. The data values contain single quote, double quotes, brackets etc.. Using Open CSV version 2.3

testfile.csv

id|name|phone
1|Rahul|123
2|Kumar's|456
3|Neetu"s|789
4|Ravi [Roma]|234

Table created as -

drop table test_schema.hive_test;
CREATE EXTERNAL TABLE test_schema.hive_test (id string, name string, phone string) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES
(
'separatorChar' = '|',
'quoteChar' = '[\'\"]',
'escapeChar' = '\\'
)
LOCATION '/staging/test/hive'
tblproperties ("skip.header.line.count"="1");

Output -

+-------------+---------------+----------------+
| hive_test.id|hive_test.name |hive_test.phone |
+-------------+---------------+----------------+
| 1           | Rahul         |123             |
| 2           | Kumar's       |456             |
| 3           | Neetu"s       |789             |
| 4           | NULL          |234             |
+---------------+------------------------------+

Expected Output -

+-------------+---------------+----------------+
| hive_test.id|hive_test.name |hive_test.phone |
+-------------+---------------+----------------+
| 1           | Rahul         |123             |
| 2           | Kumar's       |456             |
| 3           | Neetu"s       |789             |
| 4           | Ravi [Roma]   |234             |
+---------------+------------------------------+

The problem is we have multiple escape sequence/quote character in data and hence we need to include all of them.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
Khilesh Chauhan
  • 739
  • 1
  • 10
  • 36
  • testfile.csv contains unquoted strings. it should work with delimiter only(separator), try to set 'quoteChar' = '' (empty) – leftjoin Jun 22 '21 at 20:49
  • 'quoteChar' should not be empty. Tried to create table with empty quote character and it threw error "String index out of range : 0 " – Khilesh Chauhan Jun 23 '21 at 05:51

1 Answers1

1

Use LazySimpleSerDe (STORED AS TEXTFILE):

CREATE EXTERNAL TABLE test_schema.hive_test (id string, name string, phone string) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' 
ESCAPED BY '\\' 
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/staging/test/hive_test'
tblproperties ("skip.header.line.count"="1");
leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • Thanks, this works great. Only issue is when I remove the files from LOCATION, the table still returns data when i do: select * from hive_test. I actually need to truncate hive_test to delete data. Why is this? – Victor Jun 24 '21 at 15:56
  • @Victor location should be table specific, not just /staging/test/hive, i will fix the answer, it should look like /staging/test/hive_test. Check your table location using DESCRIBE FORMATTED tablename. If you delete files from table location, it should not return data. Just check it is correct location. If you drop location as well, select should fail with location not found. Use hadoop fs -ls 'location/path/' to check files exist – leftjoin Jun 24 '21 at 16:04