6

I am not sure if this is a valid question but I would like to ask.

Is there a way that i can use a list with column names and generate an empty spark dataframe, the schema should be created with the elements from the list with the datatype for all columns as StringType.

e.g:

column_names = "ColA|ColB|ColC"

def Convert(string):
    li = list(string.split("|"))
    return li

schema_names = Convert(column_names)

#schema_names = ['ColA', 'ColB', 'ColC']

How can I use this list to create a DF Schema or an empty DF

**This is somewhat similar to How to create an empty DataFrame with a specified schema? , as I am also trying to create empty DF schema, but the approach I am mentioned is different. I am trying to generate the schema from the list.

darkmatter
  • 125
  • 1
  • 2
  • 10

3 Answers3

9

Since you want all columns to be StringType(), define the schema as follows:

from pyspark.sql.types import StructType, StructField, StringType

column_names = "ColA|ColB|ColC"
mySchema = StructType([StructField(c, StringType()) for c in column_names.split("|")])

Now just pass in an empty list as the data along with this schema to spark.createDataFrame():

df = spark.createDataFrame(data=[], schema=mySchema)
df.show()
#+----+----+----+
#|ColA|ColB|ColC|
#+----+----+----+
#+----+----+----+

Now you can reuse this schema for other DataFrames as well.

pault
  • 41,343
  • 15
  • 107
  • 149
4

I have a dirty solution. Probably not the best one :

column_names = "ColA|ColB|ColC"

df = spark.createDataFrame(
  [
    tuple('' for i in column_names.split("|"))
  ],
  column_names.split("|")
).where("1=0")

df.show()

+----+----+----+
|ColA|ColB|ColC|
+----+----+----+
+----+----+----+
Steven
  • 14,048
  • 6
  • 38
  • 73
2

In Scala:

val columns = List("ColA", "ColB", "ColC")
val result = columns.foldLeft(spark.emptyDataFrame)((a, b) => a.withColumn(b, lit("anyStringValue")))
result.printSchema()
result.show(false)

Output:

root
 |-- ColA: string (nullable = false)
 |-- ColB: string (nullable = false)
 |-- ColC: string (nullable = false)

+----+----+----+
|ColA|ColB|ColC|
+----+----+----+
+----+----+----+
pasha701
  • 6,831
  • 1
  • 15
  • 22