35

I am trying to create an empty dataframe in Spark (Pyspark).

I am using similar approach to the one discussed here enter link description here, but it is not working.

This is my code

df = sqlContext.createDataFrame(sc.emptyRDD(), schema)

This is the error

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 404, in createDataFrame
rdd, schema = self._createFromRDD(data, schema, samplingRatio)
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 285, in _createFromRDD
struct = self._inferSchema(rdd, samplingRatio)
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 229, in _inferSchema
first = rdd.first()
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/rdd.py", line 1320, in first
raise ValueError("RDD is empty")
ValueError: RDD is empty
Community
  • 1
  • 1
user3276768
  • 1,416
  • 3
  • 18
  • 28

12 Answers12

43

extending Joe Widen's answer, you can actually create the schema with no fields like so:

schema = StructType([])

so when you create the DataFrame using that as your schema, you'll end up with a DataFrame[].

>>> empty = sqlContext.createDataFrame(sc.emptyRDD(), schema)
DataFrame[]
>>> empty.schema
StructType(List())

In Scala, if you choose to use sqlContext.emptyDataFrame and check out the schema, it will return StructType().

scala> val empty = sqlContext.emptyDataFrame
empty: org.apache.spark.sql.DataFrame = []

scala> empty.schema
res2: org.apache.spark.sql.types.StructType = StructType()    
Community
  • 1
  • 1
Ton Torres
  • 1,509
  • 13
  • 24
27

At the time this answer was written it looks like you need some sort of schema

from pyspark.sql.types import *
field = [StructField("field1", StringType(), True)]
schema = StructType(field)

sc = spark.sparkContext
sqlContext.createDataFrame(sc.emptyRDD(), schema)
villoro
  • 1,469
  • 1
  • 11
  • 14
Joe Widen
  • 2,378
  • 1
  • 15
  • 21
  • Could you provide some source proving this claim? – Mateusz Dymczyk Jan 06 '16 at 03:11
  • 1
    Looks like its not necessary actually. Just took a look at the API information for createDataFrame and it shows the schema defaults to none, so there should be a way to create a dataframe with no schema: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html – Joe Widen Jan 06 '16 at 16:38
10

This will work with spark version 2.0.0 or more

from pyspark.sql import SQLContext
sc = spark.sparkContext
schema = StructType([StructField('col1', StringType(), False),StructField('col2', IntegerType(), True)])
sqlContext.createDataFrame(sc.emptyRDD(), schema)
braj
  • 2,545
  • 2
  • 29
  • 40
  • what part of this only works for 2.0 or more? should work in 1.6.1 right @braj259? – makansij Sep 23 '17 at 03:18
  • the spark intialization part. from 2.0 onwards thereis just one spark context for everything. so intialization is syntactically little different – braj Sep 23 '17 at 10:41
  • but if you change `sc = spark.sparkContext` to `sc = sparkContext()` then i think it should be compatible with 1.6.x right? – makansij Sep 23 '17 at 16:04
4
spark.range(0).drop("id")

This creates a DataFrame with an "id" column and no rows then drops the "id" column, leaving you with a truly empty DataFrame.

Garren S
  • 5,552
  • 3
  • 30
  • 45
1

You can just use something like this:

   pivot_table = sparkSession.createDataFrame([("99","99")], ["col1","col2"])
morienor
  • 339
  • 1
  • 2
  • 8
1

If you want an empty dataframe based on an existing one, simple limit rows to 0. In PySpark :

emptyDf = existingDf.limit(0)
1
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType

spark = SparkSession.builder.appName('SparkPractice').getOrCreate()

schema = StructType([
  StructField('firstname', StringType(), True),
  StructField('middlename', StringType(), True),
  StructField('lastname', StringType(), True)
  ])

df = spark.createDataFrame(spark.sparkContext.emptyRDD(),schema)
df.printSchema()
MahakGoyal
  • 11
  • 3
0

This is a roundabout but simple way to create an empty spark df with an inferred schema

# Initialize a spark df using one row of data with the desired schema   
init_sdf = spark.createDataFrame([('a_string', 0, 0)], ['name', 'index', 'seq_#'])
# remove the row.  Leaves the schema
empty_sdf = init_sdf.where(col('name') == 'not_match')  
empty_sdf.printSchema()
# Output
root
 |-- name: string (nullable = true)
 |-- index: long (nullable = true)
 |-- seq_#: long (nullable = true)
Gerard G
  • 171
  • 2
  • 4
0
Seq.empty[String].toDF()

This will create a empty df. Helpful for testing purposes and all. (Scala-Spark)

ss301
  • 514
  • 9
  • 22
0

In Spark 3.1.2, the spark.sparkContext.emptyRDD() function throws an error. Using the schema, passing an empty list will work:

df = spark.createDataFrame([], schema)
-1

You can do it by loading an empty file (parquet, json etc.) like this:

df = sqlContext.read.json("my_empty_file.json")

Then when you try to check the schema you'll see:

>>> df.printSchema()
root

In Scala/Java not passing a path should work too, in Python it throws an exception. Also if you ever switch to Scala/Python you can use this method to create one.

Mateusz Dymczyk
  • 14,969
  • 10
  • 59
  • 94
-2

You can create an empty data frame by using following syntax in pyspark:

df = spark.createDataFrame([], ["col1", "col2", ...])

where [] represents the empty value for col1 and col2. Then you can register as temp view for your sql queries:

**df2.createOrReplaceTempView("artist")**
Adrian Mole
  • 49,934
  • 160
  • 51
  • 83