Using KuduContext in pyspark

Question

I would like to use kudu with pyspark. While I can use it with:

sc.read.format('org.apache.kudu.spark.kudu').option('kudu.master',"hdp1:7051").option('kudu.table',"impala::test.z_kudu_tab").load()

I cannot find a way to import KuduContext. I'm working in a jupyter notebook, and importing it with:

os.environ["PYSPARK_SUBMIT_ARGS"] = "--driver-memory 2g --packages com.ibm.spss.hive.serde2.xml:hivexmlserde:1.0.5.3 --packages org.apache.kudu:kudu-spark2_2.11:1.7.0 pyspark-shell"

My not working code:

kudu_Context = KuduContext("es2-hdp1:7051", sc)

Dies with error:

NameError: name 'KuduContext' is not defined

I've also tried:

kudu_context = sc._jvm.org.apache.kudu.spark.kudu.KuduContext("hdp1:7051", sc.sparkContext)

which dies with error:

AttributeError: 'SparkContext' object has no attribute '_get_object_id'

Sorry I'm on my phone and I do not have the link. At that time, I've found an open Jira about developing kudu APIs for pyspark, so the answer is that is not possible at this time (unless using the "weird" java wrapper from python - that Jira ticket had an example of code IIRC) — Federico Ponzi, Aug 15 '19 at 19:15
I am guessing this is the link you are referring to - https://issues.apache.org/jira/browse/KUDU-1603 Using the weird java wrapper, I was able to create a new kudu context. But, faced a lot of other weird errors. — Prasanna Saraswathi Krishnan, Aug 15 '19 at 19:40
Exactly that one. Try posting a new question with your error, maybe you will be more lucky than me at that time, who knows :) — Federico Ponzi, Aug 17 '19 at 09:18

Using KuduContext in pyspark

0 Answers0