0

I am running a query in my Spark application that returns a substantially large amount of data. I would like to know how many rows of data are being queried for logging purposes. I can't seem to find a way to get the number of rows without either manually counting them, or calling a method to count for me, as the data is fairly large this gets expensive for logging. Is there a place that the rowcount is saved and available to grab?

I have read here that the Python connector saves the rowcount into the object model, but i can't seem to find any equivalent for the Spark Connector or its underlying JDBC.

The most optimal way I can find is rdd.collect().size on the RDD that Spark provides. It is about 15% faster than calling rdd.count()

Any help is appreciated

Abram
  • 59
  • 1
  • 12
  • Dataframe API has option to provide query... `sqlContext.read .format(SNOWFLAKE_SOURCE_NAME) .options(sfOptions) .option("query", "SELECT COUNT(1) FROM table") .load()` – undefined_variable Aug 12 '19 at 16:35
  • Correct me if i am wrong, but that isn't much different than simply manually counting the data i have already have. Also, that would require another query over the internet, which takes up time as well. Is there a way to get the `count(column) FROM table` in this query? `select column1, column2 from table where...` – Abram Aug 12 '19 at 18:35

1 Answers1

1

The limitation is within Spark's APIs that do not directly offer metrics of a completed distributed operation such as a row count metric after a save to table or file. Snowflake's Spark Connector is limited to the calls Apache Spark offers for its integration, and the cursor attributes otherwise available in the Snowflake Python and JDBC Connectors are not accessible through Py/Spark.

The simpler form of the question of counting an executed result, removing away Snowflake specifics, has been discussed previously with solutions: Spark: how to get the number of written rows?

Harsh J
  • 666
  • 4
  • 7