How to create an empty DataFrame with a specified schema?

Question

I want to create on DataFrame with a specified schema in Scala. I have tried to use JSON read (I mean reading empty file) but I don't think that's the best practice.

zero323 · Answer 1 · 2018-05-09T12:43:10.480

Lets assume you want a data frame with the following schema:

root
 |-- k: string (nullable = true)
 |-- v: integer (nullable = false)

You simply define schema for a data frame and use empty RDD[Row]:

import org.apache.spark.sql.types.{
    StructType, StructField, StringType, IntegerType}
import org.apache.spark.sql.Row

val schema = StructType(
    StructField("k", StringType, true) ::
    StructField("v", IntegerType, false) :: Nil)

// Spark < 2.0
// sqlContext.createDataFrame(sc.emptyRDD[Row], schema) 
spark.createDataFrame(sc.emptyRDD[Row], schema)

PySpark equivalent is almost identical:

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema = StructType([
    StructField("k", StringType(), True), StructField("v", IntegerType(), False)
])

# or df = sc.parallelize([]).toDF(schema)

# Spark < 2.0 
# sqlContext.createDataFrame([], schema)
df = spark.createDataFrame([], schema)

Using implicit encoders (Scala only) with Product types like Tuple:

import spark.implicits._

Seq.empty[(String, Int)].toDF("k", "v")

or case class:

case class KV(k: String, v: Int)

Seq.empty[KV].toDF

or

spark.emptyDataset[KV].toDF

This is the most appropriate answer - complete, and also useful if you want to reproduce the schema of an existing dataset quickly. I don't know why is it not the accepted one. — Lucas Lima, Jun 29 '20 at 20:25
How to create the df with the trait instead of case class: https://stackoverflow.com/questions/64276952/encoders-productof-a-scala-trait-schema-in-spark — supernatural, Oct 09 '20 at 09:49

score 46 · Answer 2 · edited Sep 19 '17 at 10:12

As of Spark 2.0.0, you can do the following.

Case Class

Let's define a Person case class:

scala> case class Person(id: Int, name: String)
defined class Person

Import spark SparkSession implicit Encoders:

scala> import spark.implicits._
import spark.implicits._

And use SparkSession to create an empty Dataset[Person]:

scala> spark.emptyDataset[Person]
res0: org.apache.spark.sql.Dataset[Person] = [id: int, name: string]

Schema DSL

You could also use a Schema "DSL" (see Support functions for DataFrames in org.apache.spark.sql.ColumnName).

scala> val id = $"id".int
id: org.apache.spark.sql.types.StructField = StructField(id,IntegerType,true)

scala> val name = $"name".string
name: org.apache.spark.sql.types.StructField = StructField(name,StringType,true)

scala> import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructType

scala> val mySchema = StructType(id :: name :: Nil)
mySchema: org.apache.spark.sql.types.StructType = StructType(StructField(id,IntegerType,true), StructField(name,StringType,true))

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> val emptyDF = spark.createDataFrame(sc.emptyRDD[Row], mySchema)
emptyDF: org.apache.spark.sql.DataFrame = [id: int, name: string]

scala> emptyDF.printSchema
root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)

Hi, the compiler say that `spark.emptyDataset` not exist on my module, How to use it? there are some (correct) similar to (non-correct) `val df = apache.spark.emptyDataset[RawData]`? — Peter Krauss, Oct 16 '19 at 19:01
@PeterKrauss `spark` is the value you create using `SparkSession.builder` not part of `org.apache.spark` package. There are two `spark` names in use. It's the `spark` you have available in `spark-shell` out of the box. — Jacek Laskowski, Oct 16 '19 at 22:22
Thanks Jacek. I corrected: the SparkSession.builder object is *passed as parameter* (seems the best solution) from first general initialization, now is running. — Peter Krauss, Oct 16 '19 at 22:32
Is there a way to create the empty dataframe using trait instead of case class : https://stackoverflow.com/questions/64276952/encoders-productof-a-scala-trait-schema-in-spark — supernatural, Oct 09 '20 at 09:51

score 5 · Answer 3 · edited Apr 18 '20 at 23:40

Java version to create empty DataSet:

public Dataset<Row> emptyDataSet(){

    SparkSession spark = SparkSession.builder().appName("Simple Application")
                .config("spark.master", "local").getOrCreate();

    Dataset<Row> emptyDataSet = spark.createDataFrame(new ArrayList<>(), getSchema());

    return emptyDataSet;
}

public StructType getSchema() {

    String schemaString = "column1 column2 column3 column4 column5";

    List<StructField> fields = new ArrayList<>();

    StructField indexField = DataTypes.createStructField("column0", DataTypes.LongType, true);
    fields.add(indexField);

    for (String fieldName : schemaString.split(" ")) {
        StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);
        fields.add(field);
    }

    StructType schema = DataTypes.createStructType(fields);

    return schema;
}

score 3 · Answer 4 · edited Nov 17 '16 at 13:57

3

import scala.reflect.runtime.{universe => ru}
def createEmptyDataFrame[T: ru.TypeTag] =
    hiveContext.createDataFrame(sc.emptyRDD[Row],
      ScalaReflection.schemaFor(ru.typeTag[T].tpe).dataType.asInstanceOf[StructType]
    )
  case class RawData(id: String, firstname: String, lastname: String, age: Int)
  val sourceDF = createEmptyDataFrame[RawData]

edited Nov 17 '16 at 13:57

dirceusemighini

1,344
2
16
35

answered Sep 19 '16 at 10:21

Ravindra

97
1
9

score 3 · Answer 5 · edited Oct 31 '17 at 11:29

Here you can create schema using StructType in scala and pass the Empty RDD so you will able to create empty table. Following code is for the same.

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.BooleanType
import org.apache.spark.sql.types.LongType
import org.apache.spark.sql.types.StringType



//import org.apache.hadoop.hive.serde2.objectinspector.StructField

object EmptyTable extends App {
  val conf = new SparkConf;
  val sc = new SparkContext(conf)
  //create sparksession object
  val sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()

  //Created schema for three columns 
   val schema = StructType(
    StructField("Emp_ID", LongType, true) ::
      StructField("Emp_Name", StringType, false) ::
      StructField("Emp_Salary", LongType, false) :: Nil)

      //Created Empty RDD 

  var dataRDD = sc.emptyRDD[Row]

  //pass rdd and schema to create dataframe
  val newDFSchema = sparkSession.createDataFrame(dataRDD, schema)

  newDFSchema.createOrReplaceTempView("tempSchema")

  sparkSession.sql("create table Finaltable AS select * from tempSchema")

}

score 3 · Answer 6 · answered Sep 10 '20 at 15:35

3

This is helpful for testing purposes.

Seq.empty[String].toDF()

answered Sep 10 '20 at 15:35

ss301

514
9
22

How to create empty df from trait instead :https://stackoverflow.com/questions/64276952/encoders-productof-a-scala-trait-schema-in-spark – supernatural Oct 09 '20 at 09:56

score 2 · Answer 7 · answered Dec 05 '16 at 09:22

Here is a solution that creates an empty dataframe in pyspark 2.0.0 or more.

from pyspark.sql import SQLContext
sc = spark.sparkContext
schema = StructType([StructField('col1', StringType(),False),StructField('col2', IntegerType(), True)])
sqlContext.createDataFrame(sc.emptyRDD(), schema)

score 0 · Answer 8 · answered Nov 22 '20 at 17:30

0

I had a special requirement wherein I already had a dataframe but given a certain condition I had to return an empty dataframe so I returned df.limit(0) instead.

answered Nov 22 '20 at 17:30

iamsmkr

800
2
10
29

score 0 · Answer 9 · answered Jun 20 '22 at 19:50

I'd like to add the following syntax which was not yet mentioned:

Seq[(String, Integer)]().toDF("k", "v")

It makes it clear that the () part is for values. It's empty, so the dataframe is empty.

This syntax is also beneficial for adding null values manually. It just works, while other options either don't or are overly verbose.

score -3 · Answer 10 · answered Jul 17 '19 at 00:51

-3

As of Spark 2.4.3

val df = SparkSession.builder().getOrCreate().emptyDataFrame

answered Jul 17 '19 at 00:51

Fox Fairy

59
3

9

This does not solve the schema part of the question. – Andrew Sklyarevsky Aug 27 '19 at 09:44

How to create an empty DataFrame with a specified schema?

10 Answers10

Case Class

Schema DSL

Linked

Related