I have a file whose columns are delimited by ^A and rows delimited by '\n' new line character.
I first uploaded it to HDFS and then create the table in Hive using the command like this:
CREATE EXTERNAL TABLE
IF NOT EXISTS
html_sample
( ts string,
url string,
html string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
LINES TERMINATED BY '\n'
LOCATION '/tmp/directoryname/';
However, when I do a select statement for that table. It turned out to be a mess.
The table looks like this:
ts url html
10082013 http://url.com/01 <doctype>.....style="padding-top: 10px;
text-align... NULL NULL
text-align... NULL NULL
text-align... NULL NULL
10092013 http://url.com/02 <doctype>.....style="padding-top: 10px;
text-align... NULL NULL
text-align... NULL NULL
text-align... NULL NULL
Then I went back to the text file and found there exist several ^M characters in the file, which makes the HIVE treat that ^M as new line character.
When I first created the file, I intentionally removed all the new line character from the html to guarantee that each record is one line. However, I just cannot understand how on earth the HIVE could treat a ^M as a newline character. How can I get around that without modifying my file.
(I know it might be possible to do a global substitution in VI or sed... but it just doesn't make that much sense to me how could HIVE treat ^M as \n)