0

Sorry to bother others. We need to write spark dataframe to a separate file. It should be something like this:

df.other_operations.
val schema = df.schema.toDDL
val putObjectRequest = PutObjectRequest
      .builder()
      .bucket(bucket)
      .key(key)
      .build()
s3.s3Client.putObject(putObjectRequest, RequestBody.fromString(schema))

related: Store Schema of Read File Into csv file in spark scala

Questions:

  1. Our df were mostly from reading files use spark native APIs. Do we need to cache df before reading the schema property? From here, to read some stats we need to cache() the dataframe. We try to avoid it but allow spark to decide everything.
  2. If I do something above, it will only dump schema once, not for each partition. Am I right? I actually want a way to pass this information to driver and let driver dump it only once. But Idk which way is best. Accumulator seems overkill.

Thanks.

0 Answers0