44

In PySpark it you can define a schema and read data sources with this pre-defined schema, e. g.:

Schema = StructType([ StructField("temperature", DoubleType(), True),
                      StructField("temperature_unit", StringType(), True),
                      StructField("humidity", DoubleType(), True),
                      StructField("humidity_unit", StringType(), True),
                      StructField("pressure", DoubleType(), True),
                      StructField("pressure_unit", StringType(), True)
                    ])

For some datasources it is possible to infer the schema from the data-source and get a dataframe with this schema definition.

Is it possible to get the schema definition (in the form described above) from a dataframe, where the data has been inferred before?

df.printSchema() prints the schema as a tree, but I need to reuse the schema, having it defined as above,so I can read a data-source with this schema that has been inferred before from another data-source.

Community
  • 1
  • 1
Hauke Mallow
  • 2,887
  • 3
  • 11
  • 29

5 Answers5

53

Yes it is possible. Use DataFrame.schema property

schema

Returns the schema of this DataFrame as a pyspark.sql.types.StructType.

>>> df.schema
StructType(List(StructField(age,IntegerType,true),StructField(name,StringType,true)))

New in version 1.3.

Schema can be also exported to JSON and imported back if needed.

12

The code below will give you a well formatted tabular schema definition of the known dataframe. Quite useful when you have very huge number of columns & where editing is cumbersome. You can then now apply it to your new dataframe & hand-edit any columns you may want to accordingly.

from pyspark.sql.types import StructType

schema = [i for i in df.schema] 

And then from here, you have your new schema:

NewSchema = StructType(schema)
Laenka-Oss
  • 939
  • 2
  • 17
  • 25
11

If you are looking for a DDL string from PySpark:

df: DataFrame = spark.read.load('LOCATION')
schema_json = df.schema.json()
ddl = spark.sparkContext._jvm.org.apache.spark.sql.types.DataType.fromJson(schema_json).toDDL()
Boris
  • 471
  • 4
  • 8
  • 1
    @user1119283: instead of df.schema.json() try with df.select('yourcolumn').schema.json() ? – anky Jun 08 '22 at 17:30
9

You could re-use schema for existing Dataframe

l = [('Ankita',25,'F'),('Jalfaizy',22,'M'),('saurabh',20,'M'),('Bala',26,None)]
people_rdd=spark.sparkContext.parallelize(l)
schemaPeople = people_rdd.toDF(['name','age','gender'])

schemaPeople.show()

+--------+---+------+
|    name|age|gender|
+--------+---+------+
|  Ankita| 25|     F|
|Jalfaizy| 22|     M|
| saurabh| 20|     M|
|    Bala| 26|  null|
+--------+---+------+

spark.createDataFrame(people_rdd,schemaPeople.schema).show()

+--------+---+------+
|    name|age|gender|
+--------+---+------+
|  Ankita| 25|     F|
|Jalfaizy| 22|     M|
| saurabh| 20|     M|
|    Bala| 26|  null|
+--------+---+------+

Just use df.schema to get the underlying schema of dataframe

schemaPeople.schema

StructType(List(StructField(name,StringType,true),StructField(age,LongType,true),StructField(gender,StringType,true)))
saurabh shashank
  • 1,343
  • 2
  • 14
  • 22
0

Pyspark since version 3.3.0 return df.schema in python-way https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.schema.html#pyspark.sql.DataFrame.schema

>>> df.schema
StructType([StructField('age', IntegerType(), True),
            StructField('name', StringType(), True)])
Lex Looter
  • 106
  • 5