1

I'm trying to write a Dataset object as a Parquet file using java.

I followed this example to do so but it is absurdly slow.

It takes ~1.5 minutes to write ~10mb of data, so it isn't going to scale well when I want to write hundreds of mb of data. I did some cpu profiling and found that 99% of the time came from the ParquetWriter.write() method.

I tried increasing the page size and block size of the ParquetWriter but it doesn't seem to have any effect on the performance. Is there any way to make this process faster or is it just a limitation of the Parquet library?

Olaf Kock
  • 46,930
  • 8
  • 59
  • 90
rewong03
  • 25
  • 4
  • Have you checked this https://stackoverflow.com/questions/51328393/how-to-read-and-write-parquet-files-efficiently ?? – Sagar Gangwal Aug 19 '20 at 17:02
  • Is the 10MB the size of the resulting parquet file or the input data? Also, that example seems to first prepare all data as `List`, are you sure the actual writing is slow or the creation of those objects? – Jörn Horstmann Aug 20 '20 at 11:34
  • @JörnHorstmann The ~10MB is the resulting parquet file. Serializing the `Dataset` object as a CSV results in a ~1MB file. I did some profiling with Intellij and the issue is for sure in the `.write()` method and not in the creation of the objects – rewong03 Aug 20 '20 at 14:39

1 Answers1

0

I've had reasonable luck using org.apache.parquet.hadoop.ParquetWriter to write org.apache.parquet.example.data.Group made by the org.apache.parquet.example.data.simple.SimpleGroupFactory.

https://github.com/apache/parquet-mr/blob/master/parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/DataGenerator.java

I'd love to know of a faster way (more columns x rows per second per thread).

wangd
  • 21
  • 2