HiveContext createDataFrame not working on pySpark (jupyter)

Question

I am doing an analysis on pySpark using the Jupyter notebooks. My code originally build dataframes using sqlContext = SQLContext(sc), but now I've switched to HiveContext since I will be using window functions.

My problem is that now I'm getting a Java error when trying to create the dataframe:

## Create new SQL Context.
from pyspark.sql import SQLContext
from pyspark.sql import DataFrame
from pyspark.sql import Window
from pyspark.sql.types import *
import pyspark.sql.functions as func

sqlContext = HiveContext(sc)

After this I read my data into an RDD, and create the schema for my DF.

## After loading the data we define the schema.
fields = [StructField(field_name, StringType(), True) for field_name in data_header]
schema = StructType(fields)

Now, when I try to build the DF this is the error I get:

## Build the DF.
data_df = sqlContext.createDataFrame(data_tmp, schema)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
...
/home/scala/spark-1.6.1/python/pyspark/sql/context.pyc in _get_hive_ctx(self)
    690 
    691     def _get_hive_ctx(self):
--> 692         return self._jvm.HiveContext(self._jsc.sc())
    693 
    694     def refreshTable(self, tableName):

TypeError: 'JavaPackage' object is not callable

I have been googling it without luck so far. Any advice is greatly appreciated.

It looks like you've build Spark by yourself, am I right? If this is the case can you provide some details about the method? — zero323, Jul 13 '16 at 20:28
@zero323 Yes, I followed a tutorial that was very similar to this one: http://blog.prabeeshk.com/blog/2014/10/31/install-apache-spark-on-ubuntu-14-dot-04/. Does it looks like a config issue? I'm willing to reinstall if you have any advice on how to solve this. Thanks. — masta-g3, Jul 13 '16 at 20:57

score 1 · Accepted Answer · answered Jul 13 '16 at 22:15

1

HiveContext requires binaries build with Hive support. It means you have to enable Hive profile. Since you use sbt assembly you need at least:

sbt -Phive assembly

The same is required when building with Maven, for example:

mvn -Phive -DskipTests clean package

answered Jul 13 '16 at 22:15

zero323

322,348
103
959
935

Thanks, I've tried `sbt -Phive assembly`, but now it is complaining about `Not a valid command: Phive`. Do I need to download anything or perform any other action before attempting the assembly? – masta-g3 Jul 14 '16 at 15:10
I doesn't sound right. Are you sure there is nothing missing there? Do you have sbt installed? If not then you can use `build/sbt`. If you cannot solve this you can also try deprecated: `SPARK_HIVE=true build/sbt assembly` – zero323 Jul 14 '16 at 15:20
I do have sbt installed, when I run the command it does starts compiling, but it fails after a couple of minutes. Here is the complete output log for sbt -Phive assembly: http://pastebin.com/yMDzk5WD Do you have any suggestion? I'm kind of stuck here without access to all the HiveContext functions. – masta-g3 Jul 25 '16 at 18:41
Finally got it working! The issue is that I was doing the assembly with the version of sbt that I have installed on my machine, and not the one that came with spark. So the correct way to run it was: `/home/scala/spark-1.6.1$sudo sbt/sbt -Phive assembly`. Thanks for the help. – masta-g3 Jul 26 '16 at 15:52

HiveContext createDataFrame not working on pySpark (jupyter)

1 Answers1