23

the parquet docs from cloudera shows examples of integration with pig/hive/impala. but in many cases I want to read the parquet file itself for debugging purposes.

is there a straightforward java reader api to read a parquet file ?

Thanks Yang

Alexander Oh
  • 24,223
  • 14
  • 73
  • 76
teddy teddy
  • 3,025
  • 6
  • 31
  • 48
  • This isn't a direct answer, but you may have some luck by going through the parquet-tools project that exposes a command line tool to read Parquet files and seeing what you can call from your own Java application. https://github.com/apache/incubator-parquet-mr/tree/master/parquet-tools – Jeremy Beard Feb 19 '15 at 19:56
  • related: http://stackoverflow.com/questions/30565510/how-to-read-and-write-mapstring-object-from-to-parquet-file-in-java-or-scala – okigan Jun 01 '15 at 05:05
  • 2
    @JeremyBeard That repo is empty as of 1/17 – WestCoastProjects Jan 12 '17 at 03:28
  • Possible duplicate of [How to Generate Parquet File Using Pure Java (Including Date & Decimal Types) And Upload to S3 \[Windows\]](https://stackoverflow.com/questions/47355038/how-to-generate-parquet-file-using-pure-java-including-date-decimal-types-an) – Sal Jun 24 '18 at 13:45

2 Answers2

11

Old method: (deprecated)

AvroParquetReader<GenericRecord> reader = new AvroParquetReader<GenericRecord>(file);
GenericRecord nextRecord = reader.read();

New method:

ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(file).build();
GenericRecord nextRecord = reader.read();

I got this from here and have used this in my test cases successfully.

zhongjiajie
  • 2,098
  • 1
  • 12
  • 18
rishiehari
  • 384
  • 5
  • 13
  • This doesn't work for me - I only get Caused by: java.lang.ClassCastException: cannot be cast to org.apache.avro.generic.IndexedRecord – Magnus May 28 '20 at 09:53
8

You can use AvroParquetReader from parquet-avro library to read a parquet file as a set of AVRO GenericRecord objects.

kostya
  • 9,221
  • 1
  • 29
  • 36