2

I am looking at kudu's documentation.

Below is a partial description of kudu-spark.

https://kudu.apache.org/docs/developing.html#_avoid_multiple_kudu_clients_per_cluster

Avoid multiple Kudu clients per cluster.

One common Kudu-Spark coding error is instantiating extra KuduClient objects. In kudu-spark, a KuduClient is owned by the KuduContext. Spark application code should not create another KuduClient connecting to the same cluster. Instead, application code should use the KuduContext to access a KuduClient using KuduContext#syncClient.

To diagnose multiple KuduClient instances in a Spark job, look for signs in the logs of the master being overloaded by many GetTableLocations or GetTabletLocations requests coming from different clients, usually around the same time. This symptom is especially likely in Spark Streaming code, where creating a KuduClient per task will result in periodic waves of master requests from new clients.

Does this mean that I can only run one kudu-spark task at a time?

If I have a spark-streaming program that is always writing data to the kudu, How can I connect to kudu with other spark programs?

Community
  • 1
  • 1
xuejianbest
  • 323
  • 1
  • 9
  • If the guidance is specifically about in a single application, That is no problem. But the documentation says "Avoid multiple Kudu clients **per cluster**". So I want to confirm this. – xuejianbest Jul 01 '19 at 01:29

2 Answers2

0

In a non-Spark program you use a KUDU Client for accessing KUDU. With a Spark App you use a KUDU Context that has such a Client already, for that KUDU cluster.

Simple JAVA program requires a KUDU Client using JAVA API and maven approach.

KuduClient kuduClient = new KuduClientBuilder("kudu-master-hostname").build();

See http://harshj.com/writing-a-simple-kudu-java-api-program/

Spark / Scala program of which many can be running at the same time against the same Cluster using Spark KUDU Integration. Snippet borrowed from official guide as quite some time ago I looked at this.

import org.apache.kudu.client._
import collection.JavaConverters._

// Read a table from Kudu
val df = spark.read
              .options(Map("kudu.master" -> "kudu.master:7051", "kudu.table" -> "kudu_table"))
              .format("kudu").load

// Query using the Spark API...
df.select("id").filter("id >= 5").show()

// ...or register a temporary table and use SQL
df.registerTempTable("kudu_table")
val filteredDF = spark.sql("select id from kudu_table where id >= 5").show()

// Use KuduContext to create, delete, or write to Kudu tables
val kuduContext = new KuduContext("kudu.master:7051", spark.sparkContext)

// Create a new Kudu table from a dataframe schema
// NB: No rows from the dataframe are inserted into the table
kuduContext.createTable("test_table", df.schema, Seq("key"),
                        new CreateTableOptions()
                      .setNumReplicas(1)
                      .addHashPartitions(List("key").asJava, 3))

// Insert data
kuduContext.insertRows(df, "test_table")

See https://kudu.apache.org/docs/developing.html

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
0

The more clear statement of "avoid multiple Kudu clients per cluster" is "avoid multiple Kudu clients per spark application".

Instead, application code should use the KuduContext to access a KuduClient using KuduContext#syncClient.

xuejianbest
  • 323
  • 1
  • 9