4

I have a file whose columns are delimited by ^A and rows delimited by '\n' new line character.

I first uploaded it to HDFS and then create the table in Hive using the command like this:

CREATE EXTERNAL TABLE 
IF NOT EXISTS 
html_sample 
( ts string, 
  url string, 
  html string) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\001' 
LINES TERMINATED BY '\n' 
LOCATION '/tmp/directoryname/';

However, when I do a select statement for that table. It turned out to be a mess.

The table looks like this:

ts              url                    html
10082013        http://url.com/01      <doctype>.....style="padding-top: 10px;
text-align...   NULL                   NULL 
text-align...   NULL                   NULL
text-align...   NULL                   NULL
10092013        http://url.com/02      <doctype>.....style="padding-top: 10px;
text-align...   NULL                   NULL 
text-align...   NULL                   NULL
text-align...   NULL                   NULL

Then I went back to the text file and found there exist several ^M characters in the file, which makes the HIVE treat that ^M as new line character.

When I first created the file, I intentionally removed all the new line character from the html to guarantee that each record is one line. However, I just cannot understand how on earth the HIVE could treat a ^M as a newline character. How can I get around that without modifying my file.

(I know it might be possible to do a global substitution in VI or sed... but it just doesn't make that much sense to me how could HIVE treat ^M as \n)

B.Mr.W.
  • 18,910
  • 35
  • 114
  • 178

1 Answers1

5

^M is a way in which Vim displays Windows line endings. Here's more on this: What does ^M character mean in Vim?

And Hive in its turn uses TextInputFormat which happens to treat it like a valid line terminator.

Depending on versions of Hadoop and Hive you're using there can be different ways to overcome this(from changing a property in config to custom InputFormat implementation).

Just find a way to specify separator explicitly.

And yeah, LINES TERMINATED BY '\n' does not do what it looks like. I'm using Hive 0.11 and only possible value is actually '\n' for it but it is not promoted to TextInputFormat

Community
  • 1
  • 1
ybodnar
  • 150
  • 4
  • 2
    And here's a ticket from Hadoop that addresses the problem with TextInputFormat. Maybe it will be helpful for you while searching for a concrete solution https://issues.apache.org/jira/browse/MAPREDUCE-2254 – ybodnar Oct 07 '13 at 02:52
  • 1
    thanks for the answer, I got around it by using sed -e 's/^M//g' to remove all the existing ^M in my file and that seems to work. But modify the configuration file is for sure a long-term solution. – B.Mr.W. Oct 07 '13 at 02:57