52

I am trying to load a CSV file into a Hive table like so:

CREATE TABLE mytable
(
num1 INT,
text1 STRING,
num2 INT,
text2 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";

LOAD DATA LOCAL INPATH '/data.csv'
OVERWRITE INTO TABLE mytable;    


The csv is delimited by an comma (,) and looks like this:

1, "some text, with comma in it", 123, "more text"

This will return corrupt data since there is a ',' in the first string.
Is there a way to set an text delimiter or make Hive ignore the ',' in strings?

I can't change the delimiter of the csv since it gets pulled from an external source.

wrschneider
  • 17,913
  • 16
  • 96
  • 176
Martijn Lenderink
  • 535
  • 1
  • 5
  • 5

6 Answers6

38

If you can re-create or parse your input data, you can specify an escape character for the CREATE TABLE:

ROW FORMAT DELIMITED FIELDS TERMINATED BY "," ESCAPED BY '\\';

Will accept this line as 4 fields

1,some text\, with comma in it,123,more text
libjack
  • 6,403
  • 2
  • 28
  • 36
  • 5
    That handles embedded commas, but not embedded newlines, which are the other gotcha in CSV data. Or can the newlines be escaped too? The spec at https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable doesn't seem to allow escaping newlines. – Ken Williams Jul 23 '13 at 19:17
  • 1
    Thanks a lot, this is working for me. To re-create csv file, I used python csv writer with following dialect csv.register_dialect('for_hive', escapechar='\\', quoting=csv.QUOTE_NONE) – John Prawyn Dec 10 '20 at 17:57
36

The problem is that Hive doesn't handle quoted texts. You either need to pre-process the data by changing the delimiter between the fields (e.g: with a Hadoop-streaming job) or you can also give a try to use a custom CSV SerDe which uses OpenCSV to parse the files.

Lorand Bendig
  • 10,630
  • 1
  • 38
  • 45
  • sed -i 's/"//g' your_file_name does the pre-process inplace by removing the quoted text. However, you NEED to be certain that there is no innocous removal of other intended quoted (") characters. – ekta May 22 '14 at 11:18
28

As of Hive 0.14, the CSV SerDe is a standard part of the Hive install

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

(See: https://cwiki.apache.org/confluence/display/Hive/CSV+Serde)

wrschneider
  • 17,913
  • 16
  • 96
  • 176
  • 1
    If your HIVE is up-to-date, this is the best answer :) – bartektartanus Apr 15 '16 at 08:20
  • This helped me too! – Kulasangar Jun 30 '17 at 05:26
  • 1
    When you use OpenCSVSerde is there a way to specify what Null is defined with? Using "ROW FORMAT DELIMITED" I could add the option "NULL DEFINED AS ' '" to recognize null values in the data. – JeffR Sep 13 '17 at 15:56
  • This is not working for me, Hive shows the quoted value as NULL – Gocht Nov 24 '17 at 22:42
  • @wrschneider, where can i download this serde ? – Aditya Sep 27 '19 at 09:54
  • @Aditya it should come out of the box now and you should no longer have to download it – wrschneider Oct 03 '19 at 13:17
  • @wrschneider thanks for your response. I could find it now. However this does not support multiline values in column. Example is multiline address. The only solution seem replacing the \n within quotes using scala or python. Any suggestion here ? – Aditya Oct 07 '19 at 09:57
0

keep the delimiter in single quotes it will work.

ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';

This will work

suyash
  • 11
0

Add a backward slash in FIELDS TERMINATED BY '\;'

For Example:

CREATE  TABLE demo_table_1_csv
COMMENT 'my_csv_table 1'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\;'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 'your_hdfs_path'
AS 
select a.tran_uuid,a.cust_id,a.risk_flag,a.lookback_start_date,a.lookback_end_date,b.scn_name,b.alerted_risk_category,
CASE WHEN (b.activity_id is not null ) THEN 1 ELSE 0 END as Alert_Flag 
FROM scn1_rcc1_agg as a LEFT OUTER JOIN scenario_activity_alert as b ON a.tran_uuid = b.activity_id;

I have tested it, and it worked.

Mantej Singh
  • 392
  • 4
  • 8
  • it's working since `'\;'` is the same thing as `';'`. There is no need to escape semicolon - but there is no need either – Nas Banov Oct 09 '19 at 06:49
0

ORG.APACHE.HADOOP.HIVE.SERDE2.OPENCSVSERDE Serde worked for me. My delimiter was '|' and one of the columns is enclosed in double quotes.

Query:

CREATE EXTERNAL TABLE EMAIL(MESSAGE_ID STRING, TEXT STRING, TO_ADDRS STRING, FROM_ADDRS STRING, SUBJECT STRING, DATE STRING)
ROW FORMAT SERDE 'ORG.APACHE.HADOOP.HIVE.SERDE2.OPENCSVSERDE'
WITH SERDEPROPERTIES (
     "SEPARATORCHAR" = "|",
     "QUOTECHAR"     = "\"",
     "ESCAPECHAR"    = "\""
)    
STORED AS TEXTFILE location '/user/abc/csv_folder';
Oladipo
  • 1,579
  • 3
  • 17
  • 33