1

Can someone explain to me the difference between spark.createDataFrame() and sqlContext.createDataFrame()? I have seen both used but do not understand the exact difference or when to use which.

Don
  • 3,876
  • 10
  • 47
  • 76
Ravindra Solanki
  • 43
  • 1
  • 2
  • 8

1 Answers1

1

I'm gonna assume you are using spark with a version over 2, because in the first method you seem to be referring to a SparkSession which is only available after version 2

  • spark.createDataFrame(...) is the preferred way to create a df in spark 2. Refer to the linked documentation to see possible usages, as it is an overloaded method.

  • sqlContext.createDataFrame(...) (spark version - 1.6) was the used way to create a df in spark 1.x. As you can read in the linked documentation, it is deprecated in spark 2.x and only kept for backwards compatibility

The entry point for working with structured data (rows and columns) in Spark 1.x.

As of Spark 2.0, this is replaced by SparkSession. However, we are keeping the class here for backward compatibility.

So, to answer your question, you can use both ways in spark 2.x (although the second one is deprecated so it's strongly recommended to use the first one) and you can only use the second one provided you are stuck with spark 1.x

Edit: SparkSession implementation (i.e the source code) and SQLContext implementation

Rorschach
  • 3,684
  • 7
  • 33
  • 77
mrbolichi
  • 609
  • 3
  • 11
  • 25
  • How they work internally? Is there any difference in the way they works? – Ravindra Solanki Jan 09 '19 at 18:30
  • Method `createDataFrame` is overloaded and it has 8 different signatures. All of them are the same in both `sqlContext` and `SparkSession`, so you can expect the same results. Do note, however, that `SparkSession` is the preferred way, it should be more optimized. Even if the results are the same, you can probably achieve them faster by using `SparkSession` instead of `SQLContext`. If you are interested, I have added the source code of both classes if you want to use the force and read the source. – mrbolichi Jan 09 '19 at 22:12