I was trying to compute statistics and get the statistics for individual columns. And I'm seeing that all the statistics are NULL here for all the columns. Not sure what mistake I may be doing here.
ordersSchemaDDL = "orderid Int, ordertime Timestamp, custid Int, Status String"
orders_df = spark.read \
.format("csv") \
.option("header",True) \
.schema(ordersSchemaDDL) \
.option("mode","DROPMALFORMED") \
.option("path","orders.csv") \
.load()
spark.sql("create database if not exists saveAsTable")
spark.sql("ANALYZE TABLE saveAsTable.orders_bucketed COMPUTE STATISTICS;")
spark.sql("DESCRIBE EXTENDED saveAsTable.orders_bucketed orderid;").show(truncate=False)
Orders Table: As we can see it has lot of data
+++++
orderid ordertimecustid Status
+++++
120130725 00:00:00 11599 CLOSED
220130725 00:00:00 256PENDING_PAYMENT
320130725 00:00:00 12111 COMPLETE
420130725 00:00:00 8827 CLOSED
520130725 00:00:00 11318 COMPLETE
620130725 00:00:00 7130 COMPLETE
Statistics Output:
info_name info_value
col_name orderid
data_type int
comment NULL
min NULL
max NULL
num_nulls NULL
distinct_count NULL
avg_col_len NULL
max_col_len NULL
histogram NULL