TextInputFormat vs HiveIgnoreKeyTextOutputFormat

Question

I'm just starting out with Hive, and I have a question about Input/Output Format. I'm using the OpenCSVSerde serde, but I don't understand why for text files the Input format is org.apache.hadoop.mapred.TextInputFormat but the output format is org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat.

I've read this question but it's still not clear to my why the Input/Output formats are different. Isn't that basically saying your going to store data added to this table differently the data that's read from the table??

Anyway, any help would be appreciated

score 2 · Answer 1 · answered Mar 18 '19 at 03:35

In a TextInputFormat, Keys are the position in the file (long data type), and values are the line of text. When the program reads a file, It might use the keys for random read, where while writing the text data using HiveIgnoreKeyTextOutputFormat there is no value in maintaining position as it doesn't make sense.

Hence, using HiveIgnoreKeyTextOutputFormat passes keys as null to underlining RecordWriter. When the RecordWriter receives key as null, it ignores key and just write the value with line separator. Otherwise, RecordWriter will key, then delimiter, then value and finally a line separator.

TextInputFormat vs HiveIgnoreKeyTextOutputFormat

1 Answers1