0

I use Apache Spark and Vertica

for this sample commands:

df = spark.read.format("jdbc")
.option("url" , vertica_jdbc_url).option("dbtable", 'test_table')
.option("user", "spark_user").option("password" , "password").load()

result = df.filter(df.test_column== 1).count()

I monitor Vertica database and see Spark runs this query:

SELECT 1 FROM test_table WHERE ("test_column" IS NOT NULL) AND ("test_column" = 1)

if I have for example 10 million result, spark gets 10 million 1 from database and this is not suitable

how can I get count result in optimize way ?

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
HoseinEY
  • 123
  • 1
  • 13
  • hi Mark,this question is a little different, in this question the key is count query and I want to know why Apache Spark generate select 1 instead of count query ... – HoseinEY Feb 18 '17 at 10:05
  • You did a `count(*)` in the previous question as well. To me it looks like a continuation of your previous question. If you think that is a different question, then I strongly suggest that you explicitly mention (and link) the other question, and explain why this question is different. – Mark Rotteveel Feb 18 '17 at 10:07
  • yes this question based on my other question, your suggestion is I remove this and continue my questions in original question ? – HoseinEY Feb 18 '17 at 10:16
  • If it is the same question, but just other/more details, then I suggest you delete this question and edit your original. However if this is follow-up question (you did more investigation, and are now looking at a different aspect of the problem), then you need to edit this question and make more obvious that this is a follow-up. People here don't really like it when they read a question and think "didn't I read this question a few days ago". Making explicit that it is a follow-up, and describing how/why it differs goes a long way to address that problem, and reduces the chance of close votes. – Mark Rotteveel Feb 18 '17 at 10:28

0 Answers0