Creating hive table using parquet file metadata

Question

I wrote a DataFrame as parquet file. And, I would like to read the file using Hive using the metadata from parquet.

Output from writing parquet write

_common_metadata  part-r-00000-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet  part-r-00002-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet  _SUCCESS
_metadata         part-r-00001-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet  part-r-00003-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet

Hive table

CREATE  TABLE testhive
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  '/home/gz_files/result';



FAILED: SemanticException [Error 10043]: Either list of columns or a custom serializer should be specified

How can I infer the meta data from parquet file?

If I open the _common_metadata I have below content,

PAR1LHroot
%TSN%
%TS%
%Etype%
)org.apache.spark.sql.parquet.row.metadata▒{"type":"struct","fields":[{"name":"TSN","type":"string","nullable":true,"metadata":{}},{"name":"TS","type":"string","nullable":true,"metadata":{}},{"name":"Etype","type":"string","nullable":true,"metadata":{}}]}

Or how to parse meta data file?

Did you try with the newer hive syntax? https://cwiki.apache.org/confluence/display/Hive/Parquet — Reactormonk, Nov 10 '15 at 11:03
It works if I add column names. But, parquet has schema in meta info. — WoodChopper, Nov 10 '15 at 12:13

score 12 · Accepted Answer · answered Jul 13 '16 at 03:27

12

Here's a solution I've come up with to get the metadata from parquet files in order to create a Hive table.

First start a spark-shell (Or compile it all into a Jar and run it with spark-submit, but the shell is SOO much easier)

import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.DataFrame


val df=sqlContext.parquetFile("/path/to/_common_metadata")

def creatingTableDDL(tableName:String, df:DataFrame): String={
  val cols = df.dtypes
  var ddl1 = "CREATE EXTERNAL TABLE "+tableName + " ("
  //looks at the datatypes and columns names and puts them into a string
  val colCreate = (for (c <-cols) yield(c._1+" "+c._2.replace("Type",""))).mkString(", ")
  ddl1 += colCreate + ") STORED AS PARQUET LOCATION '/wherever/you/store/the/data/'"
  ddl1
}

val test_tableDDL=creatingTableDDL("test_table",df,"test_db")

It will provide you with the datatypes that Hive will use for each column as they are stored in Parquet. E.G: CREATE EXTERNAL TABLE test_table (COL1 Decimal(38,10), COL2 String, COL3 Timestamp) STORED AS PARQUET LOCATION '/path/to/parquet/files'

answered Jul 13 '16 at 03:27

James Tobin

3,070
19
35

1

Just learned today as I tested. If the datatype it finds is "integer" spark reads it as an "integerType", so when i replace the "Type" it becomes "integer". hive doesn't like "integer" so you'll have to change that in the ddl to "int", but that's a small change you can work out on your own =) – James Tobin Jul 13 '16 at 18:56
Very nice, that should be added to Spark SQL! – Thomas Decaux Dec 18 '17 at 08:36
Are you sure that the code compiles? I mean, `creatingTableDDL("test_table",df,"test_db")` has 3 arguments, but the method definition only 2. What is the purpose of the argument `"test_db"`? – UninformedUser Feb 26 '18 at 18:22
that was c/p from a local copy where the db could be specified as well; so, no, if you c/p the above and run the example without any changes, it would not work, but `creatingTableDDL("test_table",df)` would – James Tobin Mar 08 '18 at 19:22

score 11 · Answer 2 · answered Sep 28 '16 at 08:48

I'd just like to expand on James Tobin's answer. There's a StructField class which provides Hive's data types without doing string replacements.

// Tested on Spark 1.6.0.

import org.apache.spark.sql.DataFrame

def dataFrameToDDL(dataFrame: DataFrame, tableName: String): String = {
    val columns = dataFrame.schema.map { field =>
        "  " + field.name + " " + field.dataType.simpleString.toUpperCase
    }

    s"CREATE TABLE $tableName (\n${columns.mkString(",\n")}\n)"
}

This solves the IntegerType problem.

scala> val dataFrame = sc.parallelize(Seq((1, "a"), (2, "b"))).toDF("x", "y")
dataFrame: org.apache.spark.sql.DataFrame = [x: int, y: string]

scala> print(dataFrameToDDL(dataFrame, "t"))
CREATE TABLE t (
  x INT,
  y STRING
)

This should work with any DataFrame, not just with Parquet. (e.g., I'm using this with a JDBC DataFrame.)

As an added bonus, if your target DDL supports nullable columns, you can extend the function by checking StructField.nullable.

This is a much more 'real' answer than the one I provided. Going with this one myself. — James Tobin, Oct 10 '16 at 12:34
How about if table is partitioned?? I have a database where some of the tables are partitioned and some are not. — Faisal Ahmed Siddiqui, Oct 29 '18 at 18:31

score 2 · Answer 3 · answered Aug 20 '19 at 12:44

I would like to expand James answer,

The following code will work for all datatypes including ARRAY, MAP and STRUCT.

Have tested in SPARK 2.2

val df=sqlContext.parquetFile("parquetFilePath")
val schema = df.schema
var columns = schema.fields
var ddl1 = "CREATE EXTERNAL TABLE " tableName + " ("
val cols=(for(column <- columns) yield column.name+" "+column.dataType.sql).mkString(",")
ddl1=ddl1+cols+" ) STORED AS PARQUET LOCATION '/tmp/hive_test1/'"
spark.sql(ddl1)

Tagar · Answer 4 · 2022-05-16T05:21:18.687

1

Actually, Impala supports

CREATE TABLE LIKE PARQUET

(no columns section altogether):

https://docs.cloudera.com/runtime/7.2.15/impala-sql-reference/topics/impala-create-table.html

Tags of your question have "hive" and "spark" and I don't see this is implemented in Hive, but in case you use CDH, it may be what you were looking for.

edited May 16 '22 at 05:21

answered Nov 27 '15 at 05:05

Tagar

13,911
6
95
110

Updated (working) link: https://docs.cloudera.com/runtime/7.2.15/impala-sql-reference/topics/impala-create-table.html – Ryan Jendoubi May 16 '22 at 02:28

score 1 · Answer 5 · answered Oct 17 '16 at 13:25

A small improvement over Victor (adding quotes on field.name) and modified to bind the table to a local parquet file (tested on spark 1.6.1)

def dataFrameToDDL(dataFrame: DataFrame, tableName: String, absFilePath: String): String = {
    val columns = dataFrame.schema.map { field =>
      "  `" + field.name + "` " + field.dataType.simpleString.toUpperCase
    }
    s"CREATE EXTERNAL TABLE $tableName (\n${columns.mkString(",\n")}\n) STORED AS PARQUET LOCATION '"+absFilePath+"'"
  }

Also notice that:

A HiveContext is needed since SQLContext does not support creating external table.
The path to the parquet folder must be an absolute path

Tagar · Answer 6 · 2015-11-27T06:15:31.223

I had the same question. It might be hard to implement from pratcical side though, as Parquet supports schema evolution:

http://www.cloudera.com/content/www/en-us/documentation/archive/impala/2-x/2-0-x/topics/impala_parquet.html#parquet_schema_evolution_unique_1

For example, you could add a new column to your table and you don't have to touch data that's already in the table. It's only new datafiles will have new metadata (compatible with previous version).

Schema merging is switched off by default since Spark 1.5.0 since it is "relatively expensive operation" http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging So infering most recent schema may not be as simple as it sounds. Although quick-and-dirty approaches are quite possible e.g. by parsing output from

$ parquet-tools schema /home/gz_files/result/000000_0

Creating hive table using parquet file metadata

6 Answers6

Linked