0

My situation:

I have a set of sources and I have to pass them through layers of data, suppose that I have the layers A, B and C. Sometimes, any source lands in A layer with no data, only the header of the source, in my case, all data in A is avro. Then I have to pass it from A to B, in my case, layer B can be csv. Recently, the requires of layer B change and now I have parquet files too. I need the files because layer C need something to read, the header at least.

My problem:

It is when I have to parse that only header avro file to parquet file. Is there any solution using Spark/scala that can write only the header of a avro, parquet, etc format files?

I have a code that can parse only headers to csv, just listing the columns and writing that as csv or plainText but when I try to write in avro or parquet, it only writes the _SUCCESS flag of spark. I have used the different save modes and properties that I have found and spark accepts.

For more information, I use spark 2.3.1 version, scala 2.11.11

H. M.
  • 109
  • 1
  • 1
  • 9
  • Did you try creating an [*empty* `DataFrame` with your specified *schema*](https://stackoverflow.com/q/31477598/3679900) (*headers*) and saving it as `Parquet` / `Avro` file? – y2k-shubham Nov 20 '18 at 08:41
  • Yes, i tried creating new empty dataframes, reading empty csv/avro/parquet file(only with header), reading a file with data and deleting all rows, enabling the infer schema of Spark and using a new specified schema and when i go to write, i only have the _SUCCESS – H. M. Nov 20 '18 at 10:06

0 Answers0