I use Apache Spark and Vertica
for this sample commands:
df = spark.read.format("jdbc")
.option("url" , vertica_jdbc_url).option("dbtable", 'test_table')
.option("user", "spark_user").option("password" , "password").load()
result = df.filter(df.test_column== 1).count()
I monitor Vertica database and see Spark runs this query:
SELECT 1 FROM test_table WHERE ("test_column" IS NOT NULL) AND ("test_column" = 1)
if I have for example 10 million result, spark gets 10 million 1 from database and this is not suitable
how can I get count result in optimize way ?