-1

Do You think statistics on column generation have sense in delta lake ? Does it optimize joins & aggregations or maybe there counts only statistics inside _delta_log ?

enter image description here

tryied to see if statistics have influence on performance searching alot of internet resources without answer

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • some links from here would be useful: https://stackoverflow.com/questions/70881505/databricks-z-order-vs-partitionby/70883660#70883660 – Alex Ott Jul 19 '23 at 11:56
  • @AlexOtt but does statistics on column created by analyze table... works ? do i have to use this. This articles are more about Z-ordering & partitions – Edzio Edziowski Jul 19 '23 at 12:25
  • Analyze table collects more than data for data skipping: https://docs.databricks.com/sql/language-manual/sql-ref-syntax-aux-analyze-table.html - this data is used then for query planning. But checking the answer may help you a bit to reason about performance – Alex Ott Jul 19 '23 at 13:03

1 Answers1

0

Delta Lake metadata statistics can be used to optimize different types of queries. Here are examples of queries that can run faster by leveraging the table metadata:

  • select count(*) from the_table
  • select * from the_table limit 1

See this blog post for more information.

Column level statistics can optimize any operation whenever entire files can get skipped. Reading less data always makes queries run faster.

Powers
  • 18,150
  • 10
  • 103
  • 108
  • But I have asked about ANALYZE TABLE tab_name COMPUTE STATISTICS FOR ALL COLUMNS. Not about stats in delta tables. DOes this statistics for columns helps in query times , joins or something else in delta lake ? – Edzio Edziowski Aug 03 '23 at 08:29