Save spark dataframe schema to hdfs

Question

For a given data frame (df) we get the schema by df.schema, which is a StructType array. Can I save just this schema onto hdfs, while running from spark-shell? Also, what would be the best format in which the schema should be saved?

[Please Refer this: I think You will find the answer](https://stackoverflow.com/questions/50816767/how-to-save-result-of-printschema-to-a-file-in-pyspark) — Prathik Kini, Apr 05 '19 at 12:43

score 0 · Answer 1 · answered Feb 14 '21 at 13:41

You can use treeString

schema = df._jdf.schema().treeString()

and convert it to an RDD and use saveAsTextFile:

sc.parallelize([schema ]).saveAsTextFile(...)

Or to use saveAsPickleFile:

temp_rdd = sc.parallelize(schema)
temp_rdd.coalesce(1).saveAsPickleFile("s3a://path/to/destination_schema.pickle")

score -1 · Answer 2 · answered Jan 13 '18 at 15:11

-1

Yes, you can save the schema as df.write.format("parquet").save("path") 
#Give path as a HDFS path

You can read also hdfs sqlContext.read.parquet("Path") #Give HDFS Path

Parquet + compression is the best storage strategy whether it resides on S3 
or not.

Parquet is a columnar format, so it performs well without iterating over all 
columns.
Please refer this link also https://stackoverflow.com/questions/34361222/dataframe-to-hdfs-in-spark-
scala

answered Jan 13 '18 at 15:11

Mohit.kc

73
1
1
7

Thanks, I am familiar with that approach, it saves the complete data frame, I am interested in just saving the schema `df.schema` onto hdfs. – Ashwin Jan 14 '18 at 09:22
I am not sure about that.I didn't get any article about this also. So if you will get to know please tell me also. – Mohit.kc Jan 14 '18 at 16:19
2

I figured a way to make it work - `val rdd = sc.parallelize(df.schema)` `rdd.coalesce(1).saveAsObjectFile("")` `val rdd2: RDD[StructField] = sc.objectFile("")` `StructType(rdd2.collect())` – Ashwin Jan 15 '18 at 04:08
If using Python, `saveAsPickleFile` and `pickleFile` should be used as `saveAsObjectFile` and `objectFile` aren't available. – Gianfranco Reppucci Jul 03 '18 at 11:16

Save spark dataframe schema to hdfs

2 Answers2