1

For a given data frame (df) we get the schema by df.schema, which is a StructType array. Can I save just this schema onto hdfs, while running from spark-shell? Also, what would be the best format in which the schema should be saved?

Ashwin
  • 55
  • 2
  • 5
  • [Please Refer this: I think You will find the answer](https://stackoverflow.com/questions/50816767/how-to-save-result-of-printschema-to-a-file-in-pyspark) – Prathik Kini Apr 05 '19 at 12:43

2 Answers2

0

You can use treeString

schema = df._jdf.schema().treeString()

and convert it to an RDD and use saveAsTextFile:

sc.parallelize([schema ]).saveAsTextFile(...)

Or to use saveAsPickleFile:

temp_rdd = sc.parallelize(schema)
temp_rdd.coalesce(1).saveAsPickleFile("s3a://path/to/destination_schema.pickle")
Rene B.
  • 6,557
  • 7
  • 46
  • 72
-1
Yes, you can save the schema as df.write.format("parquet").save("path") 
#Give path as a HDFS path

You can read also hdfs sqlContext.read.parquet("Path") #Give HDFS Path

Parquet + compression is the best storage strategy whether it resides on S3 
or not.

Parquet is a columnar format, so it performs well without iterating over all 
columns.
Please refer this link also https://stackoverflow.com/questions/34361222/dataframe-to-hdfs-in-spark-
scala
Mohit.kc
  • 73
  • 1
  • 1
  • 7
  • Thanks, I am familiar with that approach, it saves the complete data frame, I am interested in just saving the schema `df.schema` onto hdfs. – Ashwin Jan 14 '18 at 09:22
  • I am not sure about that.I didn't get any article about this also. So if you will get to know please tell me also. – Mohit.kc Jan 14 '18 at 16:19
  • 2
    I figured a way to make it work - `val rdd = sc.parallelize(df.schema)` `rdd.coalesce(1).saveAsObjectFile("")` `val rdd2: RDD[StructField] = sc.objectFile("")` `StructType(rdd2.collect())` – Ashwin Jan 15 '18 at 04:08
  • If using Python, `saveAsPickleFile` and `pickleFile` should be used as `saveAsObjectFile` and `objectFile` aren't available. – Gianfranco Reppucci Jul 03 '18 at 11:16