Databricks Statistics on delta table

Question

Do You think statistics on column generation have sense in delta lake ? Does it optimize joins & aggregations or maybe there counts only statistics inside _delta_log ?

enter image description here

tryied to see if statistics have influence on performance searching alot of internet resources without answer

some links from here would be useful: https://stackoverflow.com/questions/70881505/databricks-z-order-vs-partitionby/70883660#70883660 — Alex Ott, Jul 19 '23 at 11:56
@AlexOtt but does statistics on column created by analyze table... works ? do i have to use this. This articles are more about Z-ordering & partitions — Edzio Edziowski, Jul 19 '23 at 12:25
Analyze table collects more than data for data skipping: https://docs.databricks.com/sql/language-manual/sql-ref-syntax-aux-analyze-table.html - this data is used then for query planning. But checking the answer may help you a bit to reason about performance — Alex Ott, Jul 19 '23 at 13:03

score 0 · Answer 1 · answered Jul 22 '23 at 16:28

0

Delta Lake metadata statistics can be used to optimize different types of queries. Here are examples of queries that can run faster by leveraging the table metadata:

select count(*) from the_table
select * from the_table limit 1

See this blog post for more information.

Column level statistics can optimize any operation whenever entire files can get skipped. Reading less data always makes queries run faster.

answered Jul 22 '23 at 16:28

Powers

18,150
10
103
108

But I have asked about ANALYZE TABLE tab_name COMPUTE STATISTICS FOR ALL COLUMNS. Not about stats in delta tables. DOes this statistics for columns helps in query times , joins or something else in delta lake ? – Edzio Edziowski Aug 03 '23 at 08:29

Databricks Statistics on delta table

1 Answers1