0

I'm tried to load a local file as dataframe with using spark_session and sqlContext.

df = spark_session.read...load(localpath) 

It couldn't read local files. df is empty. But, after creating sqlcontext from spark_context, it could load a local file.

sqlContext = SQLContext(spark_context)
df = sqlContext.read...load(localpath)

It worked fine. But I can't understand why. What is the cause ?

Envionment: Windows10, spark 2.2.1

EDIT

Finally I've resolved this problem. The root cause is version difference between PySpark installed with pip and PySpark installed in local file system. PySpark failed to start because of py4j failing.

hiropon
  • 1,675
  • 2
  • 18
  • 41

1 Answers1

1

I am pasting a sample code that might help. We have used this to create a Sparksession object and read a local file with it:

import org.apache.spark.sql.SparkSession

object SetTopBox_KPI1_1 {

  def main(args: Array[String]): Unit = {
    if(args.length < 2) {
      System.err.println("SetTopBox Data Analysis <Input-File> OR <Output-File> is missing")
      System.exit(1)
    }

    val spark = SparkSession.builder().appName("KPI1_1").getOrCreate()

    val record = spark.read.textFile(args(0)).rdd

.....

On the whole, in Spark 2.2 the preferred way to use Spark is by creating a SparkSession object.

Prashant
  • 702
  • 6
  • 21