How to create an empty DataFrame? Why "ValueError: RDD is empty"?

Question

I am trying to create an empty dataframe in Spark (Pyspark).

I am using similar approach to the one discussed here enter link description here, but it is not working.

This is my code

df = sqlContext.createDataFrame(sc.emptyRDD(), schema)

This is the error

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 404, in createDataFrame
rdd, schema = self._createFromRDD(data, schema, samplingRatio)
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 285, in _createFromRDD
struct = self._inferSchema(rdd, samplingRatio)
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 229, in _inferSchema
first = rdd.first()
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/rdd.py", line 1320, in first
raise ValueError("RDD is empty")
ValueError: RDD is empty

score 43 · Accepted Answer · edited May 23 '17 at 11:47

extending Joe Widen's answer, you can actually create the schema with no fields like so:

schema = StructType([])

so when you create the DataFrame using that as your schema, you'll end up with a DataFrame[].

>>> empty = sqlContext.createDataFrame(sc.emptyRDD(), schema)
DataFrame[]
>>> empty.schema
StructType(List())

In Scala, if you choose to use sqlContext.emptyDataFrame and check out the schema, it will return StructType().

scala> val empty = sqlContext.emptyDataFrame
empty: org.apache.spark.sql.DataFrame = []

scala> empty.schema
res2: org.apache.spark.sql.types.StructType = StructType()

score 27 · Answer 2 · edited Feb 09 '20 at 00:26

27

At the time this answer was written it looks like you need some sort of schema

from pyspark.sql.types import *
field = [StructField("field1", StringType(), True)]
schema = StructType(field)

sc = spark.sparkContext
sqlContext.createDataFrame(sc.emptyRDD(), schema)

edited Feb 09 '20 at 00:26

villoro

1,469
1
11
14

answered Jan 06 '16 at 02:44

Joe Widen

2,378
1
15
21

Could you provide some source proving this claim? – Mateusz Dymczyk Jan 06 '16 at 03:11
1

Looks like its not necessary actually. Just took a look at the API information for createDataFrame and it shows the schema defaults to none, so there should be a way to create a dataframe with no schema: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html – Joe Widen Jan 06 '16 at 16:38

score 10 · Answer 3 · answered Dec 05 '16 at 09:32

10

This will work with spark version 2.0.0 or more

from pyspark.sql import SQLContext
sc = spark.sparkContext
schema = StructType([StructField('col1', StringType(), False),StructField('col2', IntegerType(), True)])
sqlContext.createDataFrame(sc.emptyRDD(), schema)

answered Dec 05 '16 at 09:32

braj

2,545
2
29
40

what part of this only works for 2.0 or more? should work in 1.6.1 right @braj259? – makansij Sep 23 '17 at 03:18
the spark intialization part. from 2.0 onwards thereis just one spark context for everything. so intialization is syntactically little different – braj Sep 23 '17 at 10:41
but if you change `sc = spark.sparkContext` to `sc = sparkContext()` then i think it should be compatible with 1.6.x right? – makansij Sep 23 '17 at 16:04

score 4 · Answer 4 · answered Jun 20 '19 at 18:16

4

spark.range(0).drop("id")

This creates a DataFrame with an "id" column and no rows then drops the "id" column, leaving you with a truly empty DataFrame.

answered Jun 20 '19 at 18:16

Garren S

5,552
3
30
45

score 1 · Answer 5 · answered Aug 31 '18 at 11:17

1

You can just use something like this:

   pivot_table = sparkSession.createDataFrame([("99","99")], ["col1","col2"])

answered Aug 31 '18 at 11:17

morienor

339
1
2
8

score 1 · Answer 6 · answered Sep 25 '20 at 07:24

1

If you want an empty dataframe based on an existing one, simple limit rows to 0. In PySpark :

emptyDf = existingDf.limit(0)

answered Sep 25 '20 at 07:24

Prasanna Saraswathi Krishnan

599
1
4
17

score 1 · Answer 7 · answered Nov 25 '20 at 13:12

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType

spark = SparkSession.builder.appName('SparkPractice').getOrCreate()

schema = StructType([
  StructField('firstname', StringType(), True),
  StructField('middlename', StringType(), True),
  StructField('lastname', StringType(), True)
  ])

df = spark.createDataFrame(spark.sparkContext.emptyRDD(),schema)
df.printSchema()

score 0 · Answer 8 · answered Apr 11 '20 at 06:03

This is a roundabout but simple way to create an empty spark df with an inferred schema

# Initialize a spark df using one row of data with the desired schema   
init_sdf = spark.createDataFrame([('a_string', 0, 0)], ['name', 'index', 'seq_#'])
# remove the row.  Leaves the schema
empty_sdf = init_sdf.where(col('name') == 'not_match')  
empty_sdf.printSchema()
# Output
root
 |-- name: string (nullable = true)
 |-- index: long (nullable = true)
 |-- seq_#: long (nullable = true)

score 0 · Answer 9 · answered Sep 10 '20 at 15:33

0

Seq.empty[String].toDF()

This will create a empty df. Helpful for testing purposes and all. (Scala-Spark)

answered Sep 10 '20 at 15:33

ss301

514
9
22

score 0 · Answer 10 · answered Dec 13 '21 at 18:40

0

In Spark 3.1.2, the spark.sparkContext.emptyRDD() function throws an error. Using the schema, passing an empty list will work:

df = spark.createDataFrame([], schema)

answered Dec 13 '21 at 18:40

Michael Rice

72
2

Mateusz Dymczyk · Answer 11 · 2016-01-06T03:13:45.710

-1

You can do it by loading an empty file (parquet, json etc.) like this:

df = sqlContext.read.json("my_empty_file.json")

Then when you try to check the schema you'll see:

>>> df.printSchema()
root

In Scala/Java not passing a path should work too, in Python it throws an exception. Also if you ever switch to Scala/Python you can use this method to create one.

edited Jan 06 '16 at 03:13

answered Jan 06 '16 at 03:08

Mateusz Dymczyk

14,969
10
59
94

score -2 · Answer 12 · edited Oct 06 '19 at 08:03

-2

You can create an empty data frame by using following syntax in pyspark:

df = spark.createDataFrame([], ["col1", "col2", ...])

where [] represents the empty value for col1 and col2. Then you can register as temp view for your sql queries:

**df2.createOrReplaceTempView("artist")**

edited Oct 06 '19 at 08:03

Adrian Mole

49,934
160
51
83

answered Oct 06 '19 at 07:43

asheesh kumar singhal

145
1
4

3

It says "Cannot infer schema from empty dataframe" – Sonal Dubey Mar 17 '20 at 11:24

How to create an empty DataFrame? Why "ValueError: RDD is empty"?

12 Answers12

Linked