15

I would like to use Apache's parquet-mr project to read/write Parquet files programmatically with Java. I can't seem to find any documentation for how to use this API (aside from going through the source code and seeing how it's used) -- just wondering if any such documentation exists?

Jason Evans
  • 1,197
  • 1
  • 13
  • 22
  • Better go through the unit tests, I couldn't find any documents yet. :) – Devas May 03 '17 at 07:39
  • 1
    Mean while you can go through [this](http://stackoverflow.com/questions/42078757/is-it-possible-to-read-and-write-parquet-using-java-without-a-dependency-on-hado/42224290#42224290) as a sample. – Devas May 03 '17 at 07:45
  • Thanks @Krishas, that's a start – Jason Evans May 03 '17 at 13:30

4 Answers4

10

I wrote a blog article about reading parquet files (http://www.jofre.de/?p=1459) and came up with the following solution that even is capable of reading INT96 fields.

You need the following maven dependencies:

<dependencies>
  <dependency>
    <groupId>org.apache.parquet</groupId>
    <artifactId>parquet-hadoop</artifactId>
    <version>1.9.0</version>
  </dependency>
  <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>2.7.0</version>
  </dependency>
</dependencies>

The code basically is:

public class Main {

    private static Path path = new Path("file:\\C:\\Users\\file.snappy.parquet");

    private static void printGroup(Group g) {

        int fieldCount = g.getType().getFieldCount();
        for (int field = 0; field < fieldCount; field++) {
            int valueCount = g.getFieldRepetitionCount(field);

            Type fieldType = g.getType().getType(field);
            String fieldName = fieldType.getName();

            for (int index = 0; index < valueCount; index++) {
                if (fieldType.isPrimitive()) {
                    System.out.println(fieldName + " " + g.getValueToString(field, index));
                }
            }
        }

    }

    public static void main(String[] args) throws IllegalArgumentException {

        Configuration conf = new Configuration();

        try {
            ParquetMetadata readFooter = ParquetFileReader.readFooter(conf, path, ParquetMetadataConverter.NO_FILTER);
            MessageType schema = readFooter.getFileMetaData().getSchema();
            ParquetFileReader r = new ParquetFileReader(conf, path, readFooter);

            PageReadStore pages = null;
            try {
                while (null != (pages = r.readNextRowGroup())) {
                    final long rows = pages.getRowCount();
                    System.out.println("Number of rows: " + rows);

                    final MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema);
                    final RecordReader<Group> recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema));
                    for (int i = 0; i < rows; i++) {
                        final Group g = recordReader.read();
                        printGroup(g);

                        // TODO Compare to System.out.println(g);
                    }
                }
            } finally {
                r.close();
            }
        } catch (IOException e) {
            System.out.println("Error reading parquet file.");
            e.printStackTrace();
        }

    }
}
burubum
  • 640
  • 1
  • 6
  • 18
padmalcom
  • 1,156
  • 3
  • 16
  • 30
  • 1
    Where is the recursive call when you try to System out values? It seems to me that if the type is not primitive, the code does nothing about that field. – Sinan Erdem Apr 02 '19 at 12:45
  • But do you think we should use classes from org.apache.parquet.example package for reading parquet files? – Aivaras Apr 16 '20 at 12:53
6

You can find the docs at this link: https://www.javadoc.io/doc/org.apache.parquet/parquet-column/1.10.0

Use the upper left dropdown list to navigate

Osama Khalifa
  • 363
  • 5
  • 10
1

Documentation is a bit sparse and the code is somewhat tersely documented. I found ORC much easier to work with if that's an option for you.

The code snippet below converts a Parquet file to CSV with a header row using the Avro interface - it will fail if you have the INT96 (Hive timestamp) type in the file (an Avro interface limitation) and decimals come out as a byte array.

Make sure you use version 1.9.0 or higher of the parquet-avro library otherwise the logging is a bit of a mess.

        BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(java.io.FileDescriptor.out), "ASCII"));

        ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(path).build();

        Schema sc = null;
        List<Field> fields = null;
        for(long i = 0; i < lines; i++)  {
            GenericRecord result = reader.read();
            if(result == null)  {
                break;
            }

            if(i == 0)  {
                sc = result.getSchema();
                fields = sc.getFields();
                if(header)  {       // print header out?
                    for(int j = 0; j < fields.size(); j++)  {
                        if(j != 0)  {
                            out.write(",");
                        }
                        out.write(fields.get(j).name());
                    }
                    out.newLine();
                }
            }

            for(int j = 0; j < fields.size(); j++)  {
                if(j != 0)  {
                    out.write(",");
                }
                Object o = result.get(j);
                if(o != null)  {
                    String v = o.toString();
                    if(!v.equals("null"))  {
                        out.write("\"" + v + "\"");
                    }
                }
            }
            out.newLine();
        }
        out.flush();
        reader.close();
Mark
  • 21
  • 3
  • Thanks for the answer @FatFreddie, this is helpful but I'm really looking for documentation for the parquet-mr library, not parquet-avro. – Jason Evans May 04 '17 at 18:18
  • 1
    As I understand it, Parquet-mr is the Java interface to Parquet. Within that you have a variety of interfaces... parquet-avro parquet-thrift parquet-protobuf etc There is also the "simple" interface used by parquet-tools (the CLI utility) - search the repo for CatCommand.java. The simple interface is easy enough to get going but as far as I can tell, doesn't support read schemas and I've seen comments that it was only intended as an example interface so I stopped using that. The Avro interface does support read schema and generally works quite well but doesn't support INT96. – Mark May 04 '17 at 19:19
0

This is an addition to @padmalcom's answer. The code on that answer was just lacking the recursive operation for nested values. Instead I am returning a JSONObject and it is up to the developer how to print it etc. I am using below function instead of his printGroup() function. (Thanks for the original inspiration)

private static JSONObject convertParquetGroupToJSONObject(final Group g) {
        JSONObject jsonObject = new JSONObject();

        int fieldCount = g.getType().getFieldCount();
        for (int field = 0; field < fieldCount; field++) {
            int valueCount = g.getFieldRepetitionCount(field);
            Type fieldType = g.getType().getType(field);
            String fieldName = fieldType.getName();
            for (int index = 0; index < valueCount; index++) {
                if (fieldType.isPrimitive()) {
                    try {
                        jsonObject.put(fieldName, g.getValueToString(field, index));
                    } catch (JSONException e) {
                        e.printStackTrace();
                    }
                } else{
                    try {
                        jsonObject.put(fieldName, convertParquetGroupToJSONObject(g.getGroup(field, index)));
                    } catch (JSONException e) {
                        e.printStackTrace();
                    }
                }
            }
        }
        return jsonObject;
    }
Sinan Erdem
  • 1,014
  • 1
  • 13
  • 22
  • FWIW, this doesn't handle lists or some more complex nested structures. I eventually ended up using AvroParquetReader, which handles this internally - https://github.com/benwatson528/intellij-avro-parquet-plugin/blob/master/src/main/java/uk/co/hadoopathome/intellij/viewer/fileformat/ParquetFileReader.java – Ben Watson Feb 11 '20 at 12:29