For a given data frame (df) we get the schema by df.schema, which is a StructType array. Can I save just this schema onto hdfs, while running from spark-shell? Also, what would be the best format in which the schema should be saved?
Asked
Active
Viewed 2,737 times
1
-
[Please Refer this: I think You will find the answer](https://stackoverflow.com/questions/50816767/how-to-save-result-of-printschema-to-a-file-in-pyspark) – Prathik Kini Apr 05 '19 at 12:43
2 Answers
0
You can use treeString
schema = df._jdf.schema().treeString()
and convert it to an RDD and use saveAsTextFile:
sc.parallelize([schema ]).saveAsTextFile(...)
Or to use saveAsPickleFile:
temp_rdd = sc.parallelize(schema)
temp_rdd.coalesce(1).saveAsPickleFile("s3a://path/to/destination_schema.pickle")

Rene B.
- 6,557
- 7
- 46
- 72
-1
Yes, you can save the schema as df.write.format("parquet").save("path")
#Give path as a HDFS path
You can read also hdfs sqlContext.read.parquet("Path") #Give HDFS Path
Parquet + compression is the best storage strategy whether it resides on S3
or not.
Parquet is a columnar format, so it performs well without iterating over all
columns.
Please refer this link also https://stackoverflow.com/questions/34361222/dataframe-to-hdfs-in-spark-
scala

Mohit.kc
- 73
- 1
- 1
- 7
-
Thanks, I am familiar with that approach, it saves the complete data frame, I am interested in just saving the schema `df.schema` onto hdfs. – Ashwin Jan 14 '18 at 09:22
-
I am not sure about that.I didn't get any article about this also. So if you will get to know please tell me also. – Mohit.kc Jan 14 '18 at 16:19
-
2I figured a way to make it work - `val rdd = sc.parallelize(df.schema)` `rdd.coalesce(1).saveAsObjectFile("
")` `val rdd2: RDD[StructField] = sc.objectFile(" – Ashwin Jan 15 '18 at 04:08")` `StructType(rdd2.collect())` -
If using Python, `saveAsPickleFile` and `pickleFile` should be used as `saveAsObjectFile` and `objectFile` aren't available. – Gianfranco Reppucci Jul 03 '18 at 11:16