Not sure if I follow Databricks/Delta.io terminology here but either way it may help your understanding. The metadata in Delta Lake is part of the data. So a delta table would be the data files (parquet) plus the metadata (DeltaLog = _delta_log directory within a Delta table). So a delta table directory usually looks sth like below (of any file example here, there can be many files; also we ignore some details like checkpoint files):
- tablename/
- part-*.snappy.parquet
- _delta_log/*.crc
- _delta_log/*.json
This is different to a parquet table which does not hold all this metadata. Irrespective of the table format (e.g. delta or parquet) you register the table in Hive. Hive will hold some metadata. Given that a delta table already holds a lot of metadata, the Hive metadata stored for it will differ from what is stored for a parquet table (or any other format). All these things are probably best explained with an example. I will use (Azure) Databricks as an engine since all is set up in the runtime already but it should equally apply to OSS Spark + Delta + Hive. Azure data lake is the storage and access protocol is abfss.
Create db in spark sql
create database if not exists test_ext location 'abfss://mycontainer@mystorage.dfs.core.windows.net/test'
On Hive DB
SELECT * FROM [dbo].[DBS]
DB_ID |
DESC |
DB_LOCATION_URI |
NAME |
OWNER_NAME |
OWNER_TYPE |
11 |
|
abfss://mycontainer@mystorage.dfs.core.windows.net/test |
test_ext |
root |
USER |
Create tables in spark sql and insert data
We use the classic diamonds dataset which is baked into Databricks datasets but can also be found on public sources, e.g. GitHub/tidyverse/ggplot2/diamonds.
We create two tables, one delta and one parquet format.
df = spark.read.format('csv').options(header='true', inferSchema='true').load('/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv')
df.write.format("delta").partitionBy("cut").saveAsTable("test_ext.diamonds_partitioned")
df.write.format("parquet").partitionBy("cut").saveAsTable("test_ext.diamonds_partitioned_parquet")
File system
First we look at what happened on file system level. The database is a directory here:
storage_acoount/mycontainer/test
The tables can be found within:
storage_acoount/mycontainer/test/diamonds_partitioned
storage_acoount/mycontainer/test/diamonds_partitioned_parquet
Now on table level, we see (note that data files are in subdirectories due to the partitioning by column cut
):
Hive metadata
Now we can look at different metadata such as columns, partitions, table parameters, etc.
On Hive DB
select * from dbo.[TBLS]
TBL_ID |
CREATE_TIME |
DB_ID |
LAST_ACCESS_TIME |
OWNER |
RETENTION |
SD_ID |
TBL_NAME |
TBL_TYPE |
VIEW_EXPANDED_TEXT |
VIEW_ORIGINAL_TEXT |
IS_REWRITE_ENABLED |
32 |
1659599317 |
11 |
0 |
root |
0 |
32 |
diamonds_partitioned |
MANAGED_TABLE |
NULL |
NULL |
0 |
33 |
1659599932 |
11 |
0 |
root |
0 |
33 |
diamonds_partitioned_parquet |
MANAGED_TABLE |
NULL |
NULL |
0 |
The table parameters already show us some differences. For example, there is no partition information for the delta table. However, we will find it in the _delta_log files as the metadata is stored together with the data. On the other hand, delta brings metadata which parquet does not have.
select * from dbo.[TABLE_PARAMS]
TBL_ID |
PARAM_KEY |
PARAM_VALUE |
32 |
delta.lastCommitTimestamp |
1659599315000 |
32 |
delta.lastUpdateVersion |
0 |
32 |
delta.minReaderVersion |
1 |
32 |
delta.minWriterVersion |
2 |
32 |
spark.sql.create.version |
3.2.1 |
32 |
spark.sql.partitionProvider |
catalog |
32 |
spark.sql.sources.provider |
delta |
32 |
spark.sql.sources.schema |
{"type":"struct","fields":[{"name":"_c0","type":"integer","nullable":true,"metadata":{}},{"name":"carat","type":"double","nullable":true,"metadata":{}},{"name":"cut","type":"string","nullable":true,"metadata":{}},{"name":"color","type":"string","nullable":true,"metadata":{}},{"name":"clarity","type":"string","nullable":true,"metadata":{}},{"name":"depth","type":"double","nullable":true,"metadata":{}},{"name":"table","type":"double","nullable":true,"metadata":{}},{"name":"price","type":"integer","nullable":true,"metadata":{}},{"name":"x","type":"double","nullable":true,"metadata":{}},{"name":"y","type":"double","nullable":true,"metadata":{}},{"name":"z","type":"double","nullable":true,"metadata":{}}]} |
32 |
transient_lastDdlTime |
1659599317 |
33 |
spark.sql.create.version |
3.2.1 |
33 |
spark.sql.partitionProvider |
catalog |
33 |
spark.sql.sources.provider |
parquet |
33 |
spark.sql.sources.schema |
{"type":"struct","fields":[{"name":"_c0","type":"integer","nullable":true,"metadata":{}},{"name":"carat","type":"double","nullable":true,"metadata":{}},{"name":"color","type":"string","nullable":true,"metadata":{}},{"name":"clarity","type":"string","nullable":true,"metadata":{}},{"name":"depth","type":"double","nullable":true,"metadata":{}},{"name":"table","type":"double","nullable":true,"metadata":{}},{"name":"price","type":"integer","nullable":true,"metadata":{}},{"name":"x","type":"double","nullable":true,"metadata":{}},{"name":"y","type":"double","nullable":true,"metadata":{}},{"name":"z","type":"double","nullable":true,"metadata":{}},{"name":"cut","type":"string","nullable":true,"metadata":{}}]} |
33 |
spark.sql.sources.schema.numPartCols |
1 |
33 |
spark.sql.sources.schema.partCol.0 |
cut |
33 |
transient_lastDdlTime |
1659599933 |
When we look at column info, we realize that delta only stores a generic array<string>
while parquet stores metadata for all columns.
select * from dbo.[COLUMNS_V2]
CD_ID |
COMMENT |
COLUMN_NAME |
TYPE_NAME |
INTEGER_IDX |
32 |
from deserializer |
col |
array<string> |
0 |
33 |
NULL |
_c0 |
int |
0 |
33 |
NULL |
carat |
double |
1 |
33 |
NULL |
clarity |
string |
3 |
33 |
NULL |
color |
string |
2 |
33 |
NULL |
depth |
double |
4 |
33 |
NULL |
price |
int |
6 |
33 |
NULL |
table |
double |
5 |
33 |
NULL |
x |
double |
7 |
33 |
NULL |
y |
double |
8 |
33 |
NULL |
z |
double |
9 |
DeltaLog metadata
Finally, we can look at different metadata in the delta log. As you can see there is a lot of table and column metadata but also data statistics.
First 0000*.json
.
[
{
"protocol": {
"minReaderVersion": 1,
"minWriterVersion": 2
}
},
{
"metaData": {
"id": "ef7ac39f-0977-4992-a087-a6a9f49fe650",
"format": {
"provider": "parquet",
"options": {}
},
"schemaString": "{\"type\":\"struct\",\"fields\":[{\"name\":\"_c0\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"carat\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},{\"name\":\"cut\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"color\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"clarity\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"depth\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},{\"name\":\"table\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},{\"name\":\"price\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"x\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},{\"name\":\"y\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},{\"name\":\"z\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}}]}",
"partitionColumns": [
"cut"
],
"configuration": {},
"createdTime": 1659599309716
}
},
{
"add": {
"path": "cut=Fair/part-00000-fc4c091e-4b55-434c-92f6-6e87a18f6c76.c000.snappy.parquet",
"partitionValues": {
"cut": "Fair"
},
"size": 32547,
"modificationTime": 1659599313000,
"dataChange": true,
"stats": "{\"numRecords\":1610,\"minValues\":{\"_c0\":9,\"carat\":0.22,\"color\":\"D\",\"clarity\":\"I1\",\"depth\":43.0,\"table\":49.0,\"price\":337,\"x\":0.0,\"y\":0.0,\"z\":0.0},\"maxValues\":{\"_c0\":53883,\"carat\":5.01,\"color\":\"J\",\"clarity\":\"VVS2\",\"depth\":79.0,\"table\":95.0,\"price\":18574,\"x\":10.74,\"y\":10.54,\"z\":6.98},\"nullCount\":{\"_c0\":0,\"carat\":0,\"color\":0,\"clarity\":0,\"depth\":0,\"table\":0,\"price\":0,\"x\":0,\"y\":0,\"z\":0}}",
"tags": {
"INSERTION_TIME": "1659599313000000",
"OPTIMIZE_TARGET_SIZE": "268435456"
}
}
},
{
"add": {
"path": "cut=Good/part-00000-4ed7e5c8-7a6c-49a4-a2a8-9fdb2d81ead2.c000.snappy.parquet",
"partitionValues": {
"cut": "Good"
},
"size": 76488,
"modificationTime": 1659599313000,
"dataChange": true,
"stats": "{\"numRecords\":4906,\"minValues\":{\"_c0\":3,\"carat\":0.23,\"color\":\"D\",\"clarity\":\"I1\",\"depth\":54.3,\"table\":51.0,\"price\":327,\"x\":0.0,\"y\":0.0,\"z\":0.0},\"maxValues\":{\"_c0\":53937,\"carat\":3.01,\"color\":\"J\",\"clarity\":\"VVS2\",\"depth\":67.0,\"table\":66.0,\"price\":18788,\"x\":9.44,\"y\":9.38,\"z\":5.79},\"nullCount\":{\"_c0\":0,\"carat\":0,\"color\":0,\"clarity\":0,\"depth\":0,\"table\":0,\"price\":0,\"x\":0,\"y\":0,\"z\":0}}",
"tags": {
"INSERTION_TIME": "1659599313000001",
"OPTIMIZE_TARGET_SIZE": "268435456"
}
}
},
{
"add": {
"path": "cut=Ideal/part-00000-67ceda5d-c6c0-477f-8b27-0b1ba9688764.c000.snappy.parquet",
"partitionValues": {
"cut": "Ideal"
},
"size": 292297,
"modificationTime": 1659599314000,
"dataChange": true,
"stats": "{\"numRecords\":21551,\"minValues\":{\"_c0\":1,\"carat\":0.2,\"color\":\"D\",\"clarity\":\"I1\",\"depth\":43.0,\"table\":43.0,\"price\":326,\"x\":0.0,\"y\":0.0,\"z\":0.0},\"maxValues\":{\"_c0\":53940,\"carat\":3.5,\"color\":\"J\",\"clarity\":\"VVS2\",\"depth\":66.7,\"table\":63.0,\"price\":18806,\"x\":9.65,\"y\":31.8,\"z\":6.03},\"nullCount\":{\"_c0\":0,\"carat\":0,\"color\":0,\"clarity\":0,\"depth\":0,\"table\":0,\"price\":0,\"x\":0,\"y\":0,\"z\":0}}",
"tags": {
"INSERTION_TIME": "1659599313000002",
"OPTIMIZE_TARGET_SIZE": "268435456"
}
}
},
{
"add": {
"path": "cut=Premium/part-00000-8c05dee7-a5ce-4f29-a8b1-7bbc403f391d.c000.snappy.parquet",
"partitionValues": {
"cut": "Premium"
},
"size": 199912,
"modificationTime": 1659599314000,
"dataChange": true,
"stats": "{\"numRecords\":13791,\"minValues\":{\"_c0\":2,\"carat\":0.2,\"color\":\"D\",\"clarity\":\"I1\",\"depth\":58.0,\"table\":51.0,\"price\":326,\"x\":0.0,\"y\":0.0,\"z\":0.0},\"maxValues\":{\"_c0\":53939,\"carat\":4.01,\"color\":\"J\",\"clarity\":\"VVS2\",\"depth\":63.0,\"table\":62.0,\"price\":18823,\"x\":10.14,\"y\":58.9,\"z\":8.06},\"nullCount\":{\"_c0\":0,\"carat\":0,\"color\":0,\"clarity\":0,\"depth\":0,\"table\":0,\"price\":0,\"x\":0,\"y\":0,\"z\":0}}",
"tags": {
"INSERTION_TIME": "1659599313000003",
"OPTIMIZE_TARGET_SIZE": "268435456"
}
}
},
{
"add": {
"path": "cut=Very%20Good/part-00000-9a3e66de-4c74-4f30-af25-95fa2625e3a6.c000.snappy.parquet",
"partitionValues": {
"cut": "Very Good"
},
"size": 182719,
"modificationTime": 1659599314000,
"dataChange": true,
"stats": "{\"numRecords\":12082,\"minValues\":{\"_c0\":6,\"carat\":0.2,\"color\":\"D\",\"clarity\":\"I1\",\"depth\":56.8,\"table\":44.0,\"price\":336,\"x\":0.0,\"y\":0.0,\"z\":0.0},\"maxValues\":{\"_c0\":53938,\"carat\":4.0,\"color\":\"J\",\"clarity\":\"VVS2\",\"depth\":64.9,\"table\":66.0,\"price\":18818,\"x\":10.01,\"y\":9.94,\"z\":31.8},\"nullCount\":{\"_c0\":0,\"carat\":0,\"color\":0,\"clarity\":0,\"depth\":0,\"table\":0,\"price\":0,\"x\":0,\"y\":0,\"z\":0}}",
"tags": {
"INSERTION_TIME": "1659599313000004",
"OPTIMIZE_TARGET_SIZE": "268435456"
}
}
},
{
"commitInfo": {
"timestamp": 1659599315007,
"userId": "1234567891234567",
"userName": "user@org.com",
"operation": "CREATE TABLE AS SELECT",
"operationParameters": {
"isManaged": "true",
"description": null,
"partitionBy": "[\"cut\"]",
"properties": "{}"
},
"notebook": {
"notebookId": "1234567891234567"
},
"clusterId": "1234-123456-abcdefghi",
"isolationLevel": "WriteSerializable",
"isBlindAppend": true,
"operationMetrics": {
"numFiles": "5",
"numOutputRows": "53940",
"numOutputBytes": "783963"
},
"engineInfo": "Databricks-Runtime/10.4.x-scala2.12",
"txnId": "123abc-123abc-123abc-123abc-123abc"
}
}
]
And then 000*.crc
.
{
"tableSizeBytes": 783963,
"numFiles": 5,
"numMetadata": 1,
"numProtocol": 1,
"protocol": {
"minReaderVersion": 1,
"minWriterVersion": 2
},
"metadata": {
"id": "ef7ac39f-0977-4992-a087-a6a9f49fe650",
"format": {
"provider": "parquet",
"options": {}
},
"schemaString": "{\"type\":\"struct\",\"fields\":[{\"name\":\"_c0\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"carat\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},{\"name\":\"cut\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"color\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"clarity\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"depth\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},{\"name\":\"table\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},{\"name\":\"price\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"x\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},{\"name\":\"y\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},{\"name\":\"z\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}}]}",
"partitionColumns": [
"cut"
],
"configuration": {},
"createdTime": 1659599309716
},
"histogramOpt": {
"sortedBinBoundaries": [
0,
8192,
16384,
32768,
65536,
...
],
"fileCounts": [
0,
0,
1,
0,
1,
...
],
"totalBytes": [
0,
0,
32547,
0,
76488,
...
]
}
}
A few interesting resources around the topic
ADDITION:
A small addition as to the file format differences in Hive metadata that I recently stumbled upon and found a bit surprising. When adding comments to a table via table properties, they are directly stored under the table ID in Hive dbo.table_params
in PARAM_KEY
= comment
(assuming you follow Hive standards for comments) for file formats like csv
or parquet
but they do not show up in Hive for delta
tables. For delta
tables we only find the comment in respective _delta_log
. I find this a bit surprising as comments are official table properties in Hive LanguageManual DDL. In this case I would have expected to find the comment in the delta log as well as in Hive but probably the Delta metadata implementation has consciously decided against it. But then why have other metadata like spark.sql.sources.schema
twice? It's not perfectly clear to me.
You can easily test this as shown below.
create table if not exists test.test_csv (myint int) using CSV;
create table if not exists test.test_parquet (myint int) using parquet;
create table if not exists test.test_delta (myint int) using delta;
alter table test.test_csv SET TBLPROPERTIES ('comment' = 'A table comment');
alter table test.test_parquet SET TBLPROPERTIES ('comment' = 'A table comment');
alter table test.test_delta SET TBLPROPERTIES ('comment' = 'A table comment');
--In Hive:
create or alter view dbo.v_tables as
SELECT
TBLS.TBL_ID
,DBS.NAME as DB_NAME
,TBLS.TBL_NAME
,TBLS.TBL_TYPE
FROM TBLS
,DBS
WHERE TBLS.DB_ID = DBS.DB_ID
select * from dbo.v_tables tbls left join dbo.table_params params on tbls.TBL_ID = params.TBL_ID
where tbls.db_name = 'test' and tbls.tbl_name = 'test_csv'
...