What does "the container format for fields in a row" mean for a file format?

Question

From Hadoop: The Definitive Guide:

There are two dimensions that govern table storage in Hive: the row format and the file format.

The row format dictates how rows, and the fields in a particular row, are stored. In Hive parlance, the row format is defined by a SerDe, a portmanteau word for a Serializer-Deserializer. When acting as a deserializer, which is the case when querying a table, a SerDe will deserialize a row of data from the bytes in the file to objects used internally by Hive to operate on that row of data. When used as a serializer, which is the case when performing an INSERT or CTAS (see “Importing Data” on page 500), the table’s SerDe will serialize Hive’s internal representation of a row of data into the bytes that are written to the output file.

The file format dictates the container format for fields in a row. The simplest format is a plain-text file, but there are row-oriented and column-oriented binary formats avail‐ able, too.

How is a file format different from a row format?

leftjoin · Answer 1 · 2019-05-15T13:37:42.463

Read also guide about SerDe

Hive uses SerDe (and FileFormat) to read and write table rows.

HDFS files --> InputFileFormat --> <key, value> --> Deserializer --> Row object
Row object --> Serializer --> <key, value> --> OutputFileFormat --> HDFS files

You can create tables with a custom SerDe or using a native SerDe. A native SerDe is used if ROW FORMAT is not specified or ROW FORMAT DELIMITED is specified

File Format represents file container, it can be Text or binary format like ORC or Parquet.

Row format can be simple delimited text or rather complex regexp/template-based or JSON for example.

Consider JSON formatted records in a Text file:

ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE

Or JSON records in a sequence file:

ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS SEQUENCEFILE

Everything is a Java Class actually. What is very confusing for beginners is that there are shortcuts possible in the DDL, this allows you to write DDL without specifying long and complex class names for all formats. Some classes have no corresponding shortcuts embedded in the DDL language.

STORED AS SEQUENCEFILE is a shortcut for

STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.SequenceFileInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.mapred.SequenceFileOutputFormat'

These two classes determine how to read/write file container.

And this class determines how the row should be stored and read (JSON):

ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'

And now DDL with row format and file format without shortcuts:

ROW FORMAT SERDE
    'org.apache.hive.hcatalog.data.JsonSerDe'
   STORED AS INPUTFORMAT
      'org.apache.hadoop.mapred.SequenceFileInputFormat'
      OUTPUTFORMAT
      'org.apache.hadoop.mapred.SequenceFileOutputFormat'

And for even better understanding the difference, look at the SequenceFileOutputFormat class (extends FileOutputFormat) and JsonSerDe (implements SerDe) You can dig deep and try to understand methods implemented and base classes/interfaces, look at the source code, serialize and deserialize methods in JsonSerDe class.

And "the container format for fields in a row" is FileInputFormat plus FileOutputFormat mentioned in the above DDLs. In case of ORC file for example, you cannot specify row format (delimited or other SerDe). ORC file dictates that OrcSerDe will be used only for this type of file container, which has it's own internal format for storing rows and columns. Actually you can write ROW FORMAT DELIMITED STORED AS ORC in Hive, but row format delimited will be ignored in this case.

What does "the container format for fields in a row" mean for a file format?

1 Answers1