I would like to use Apache's parquet-mr project to read/write Parquet files programmatically with Java. I can't seem to find any documentation for how to use this API (aside from going through the source code and seeing how it's used) -- just wondering if any such documentation exists?
-
Better go through the unit tests, I couldn't find any documents yet. :) – Devas May 03 '17 at 07:39
-
1Mean while you can go through [this](http://stackoverflow.com/questions/42078757/is-it-possible-to-read-and-write-parquet-using-java-without-a-dependency-on-hado/42224290#42224290) as a sample. – Devas May 03 '17 at 07:45
-
Thanks @Krishas, that's a start – Jason Evans May 03 '17 at 13:30
4 Answers
I wrote a blog article about reading parquet files (http://www.jofre.de/?p=1459) and came up with the following solution that even is capable of reading INT96 fields.
You need the following maven dependencies:
<dependencies>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-hadoop</artifactId>
<version>1.9.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.0</version>
</dependency>
</dependencies>
The code basically is:
public class Main {
private static Path path = new Path("file:\\C:\\Users\\file.snappy.parquet");
private static void printGroup(Group g) {
int fieldCount = g.getType().getFieldCount();
for (int field = 0; field < fieldCount; field++) {
int valueCount = g.getFieldRepetitionCount(field);
Type fieldType = g.getType().getType(field);
String fieldName = fieldType.getName();
for (int index = 0; index < valueCount; index++) {
if (fieldType.isPrimitive()) {
System.out.println(fieldName + " " + g.getValueToString(field, index));
}
}
}
}
public static void main(String[] args) throws IllegalArgumentException {
Configuration conf = new Configuration();
try {
ParquetMetadata readFooter = ParquetFileReader.readFooter(conf, path, ParquetMetadataConverter.NO_FILTER);
MessageType schema = readFooter.getFileMetaData().getSchema();
ParquetFileReader r = new ParquetFileReader(conf, path, readFooter);
PageReadStore pages = null;
try {
while (null != (pages = r.readNextRowGroup())) {
final long rows = pages.getRowCount();
System.out.println("Number of rows: " + rows);
final MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema);
final RecordReader<Group> recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema));
for (int i = 0; i < rows; i++) {
final Group g = recordReader.read();
printGroup(g);
// TODO Compare to System.out.println(g);
}
}
} finally {
r.close();
}
} catch (IOException e) {
System.out.println("Error reading parquet file.");
e.printStackTrace();
}
}
}
-
1Where is the recursive call when you try to System out values? It seems to me that if the type is not primitive, the code does nothing about that field. – Sinan Erdem Apr 02 '19 at 12:45
-
But do you think we should use classes from org.apache.parquet.example package for reading parquet files? – Aivaras Apr 16 '20 at 12:53
You can find the docs at this link: https://www.javadoc.io/doc/org.apache.parquet/parquet-column/1.10.0
Use the upper left dropdown list to navigate

- 363
- 5
- 10
Documentation is a bit sparse and the code is somewhat tersely documented. I found ORC much easier to work with if that's an option for you.
The code snippet below converts a Parquet file to CSV with a header row using the Avro interface - it will fail if you have the INT96 (Hive timestamp) type in the file (an Avro interface limitation) and decimals come out as a byte array.
Make sure you use version 1.9.0 or higher of the parquet-avro library otherwise the logging is a bit of a mess.
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(java.io.FileDescriptor.out), "ASCII"));
ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(path).build();
Schema sc = null;
List<Field> fields = null;
for(long i = 0; i < lines; i++) {
GenericRecord result = reader.read();
if(result == null) {
break;
}
if(i == 0) {
sc = result.getSchema();
fields = sc.getFields();
if(header) { // print header out?
for(int j = 0; j < fields.size(); j++) {
if(j != 0) {
out.write(",");
}
out.write(fields.get(j).name());
}
out.newLine();
}
}
for(int j = 0; j < fields.size(); j++) {
if(j != 0) {
out.write(",");
}
Object o = result.get(j);
if(o != null) {
String v = o.toString();
if(!v.equals("null")) {
out.write("\"" + v + "\"");
}
}
}
out.newLine();
}
out.flush();
reader.close();

- 21
- 3
-
Thanks for the answer @FatFreddie, this is helpful but I'm really looking for documentation for the parquet-mr library, not parquet-avro. – Jason Evans May 04 '17 at 18:18
-
1As I understand it, Parquet-mr is the Java interface to Parquet. Within that you have a variety of interfaces... parquet-avro parquet-thrift parquet-protobuf etc There is also the "simple" interface used by parquet-tools (the CLI utility) - search the repo for CatCommand.java. The simple interface is easy enough to get going but as far as I can tell, doesn't support read schemas and I've seen comments that it was only intended as an example interface so I stopped using that. The Avro interface does support read schema and generally works quite well but doesn't support INT96. – Mark May 04 '17 at 19:19
This is an addition to @padmalcom's answer. The code on that answer was just lacking the recursive operation for nested values. Instead I am returning a JSONObject and it is up to the developer how to print it etc. I am using below function instead of his printGroup() function. (Thanks for the original inspiration)
private static JSONObject convertParquetGroupToJSONObject(final Group g) {
JSONObject jsonObject = new JSONObject();
int fieldCount = g.getType().getFieldCount();
for (int field = 0; field < fieldCount; field++) {
int valueCount = g.getFieldRepetitionCount(field);
Type fieldType = g.getType().getType(field);
String fieldName = fieldType.getName();
for (int index = 0; index < valueCount; index++) {
if (fieldType.isPrimitive()) {
try {
jsonObject.put(fieldName, g.getValueToString(field, index));
} catch (JSONException e) {
e.printStackTrace();
}
} else{
try {
jsonObject.put(fieldName, convertParquetGroupToJSONObject(g.getGroup(field, index)));
} catch (JSONException e) {
e.printStackTrace();
}
}
}
}
return jsonObject;
}

- 1,014
- 1
- 13
- 22
-
FWIW, this doesn't handle lists or some more complex nested structures. I eventually ended up using AvroParquetReader, which handles this internally - https://github.com/benwatson528/intellij-avro-parquet-plugin/blob/master/src/main/java/uk/co/hadoopathome/intellij/viewer/fileformat/ParquetFileReader.java – Ben Watson Feb 11 '20 at 12:29