15

I'm trying to read a local Parquet file, however the only APIs I can find are tightly coupled with Hadoop, and require a Hadoop Path as input (even for pointing to a local file).

ParquetReader<GenericRecord> reader = AvroParquetReader.<GenericRecord>builder(file).build();
GenericRecord nextRecord = reader.read();

is the most popular answer in how to read a parquet file, in a standalone java code?, but requires a Hadoop Path and has now been deprecated for a mysterious InputFile instead. The only implementation of InputFile I can find is HadoopInputFile, so again no help.

In Avro this is a simple:

DatumReader<GenericRecord> datumReader = new GenericDatumReader<>();
this.dataFileReader = new DataFileReader<>(file, datumReader);

(where file is java.io.File). What's the Parquet equivalent?

I am asking for no Hadoop Path dependency in the answers, because Hadoop drags in bloat and jar hell, and it seems silly to require it for reading local files.

To further explain the backstory, I maintain a small IntelliJ plugin that allows users to drag-and-drop Avro files into a pane for viewing in a table. This plugin is currently 5MB. If I include Parquet and Hadoop dependencies, it bloats to over 50MB, and doesn't even work.


POST-ANSWER ADDENDUM

Now that I have it working (thanks to the accepted answer), here is my working solution that avoids all the annoying errors that can be dragged in by depending heavily on the Hadoop Path API:

Ben Watson
  • 5,357
  • 4
  • 42
  • 65

5 Answers5

13

Unfortunately the java parquet implementation is not independent of some hadoop libraries. There is an existing issue in their bugtracker to make it easy to read and write parquet files in java without depending on hadoop but there does not seem to be much progress on it. The InputFile interface was added to add a bit of decoupling, but a lot of the classes that implement the metadata part of parquet and also all compression codecs live inside the hadoop dependency.

I found another implementation of InputFile in the smile library, this might be more efficient than going through the hadoop filesystem abstraction, but does not solve the dependency problem.

As other answers already mention, you can create an hadoop Path for a local file and use that without problems.

java.io.File file = ...
new org.apache.hadoop.fs.Path(file.toURI())

The dependency tree that is pulled in by hadoop can be reduced a lot by defining some exclusions. I'm using the following to reduce the bloat (using gradle syntax):

compile("org.apache.hadoop:hadoop-common:3.1.0") {
    exclude(group: 'org.slf4j')
    exclude(group: 'org.mortbay.jetty')
    exclude(group: 'javax.servlet.jsp')
    exclude(group: 'com.sun.jersey')
    exclude(group: 'log4j')
    exclude(group: 'org.apache.curator')
    exclude(group: 'org.apache.zookeeper')
    exclude(group: 'org.apache.kerby')
    exclude(group: 'com.google.protobuf')
}
robd
  • 9,646
  • 5
  • 40
  • 59
Jörn Horstmann
  • 33,639
  • 11
  • 75
  • 118
  • Thanks, this is exactly why I added a bounty - this is a great answer. Detailed, with sources and explanations, and then an answer that works for me. `LocalInputFile` avoids `Path` and so doesn't trigger the `ClassNotFoundException: Class org.apache.hadoop.fs.LocalFileSystem` errors I've been getting in my IntelliJ plugin. – Ben Watson Feb 05 '20 at 08:14
  • The link to InputFile class implementation is currently broken. Here is an alternative permanent link: https://github.com/haifengl/smile/blob/6a3047c31040f20117c7c67c063975aeb29beafb/base/src/main/java/smile/io/LocalInputFile.java – aeciosan May 03 '22 at 22:35
2

parquet-tools utility seems like a good place to start. It does have some Hadoop dependencies, but works as well with local files as with HDFS (depending on defaultFS in Configuration). If you have licensing restrictions (tools are Apache V2, as everything else), you can probably just review the source for one of the content-printing commands (cat, head, or dump) for inspiration.

The closest thing to your Avro example would be using ParquetFileReader, I guess.

  Configuration conf = new Configuration();
  Path path = new Path("/parquet/file/path");
  ParquetMetadata footer = ParquetFileReader.readFooter(conf, path, ParquetMetadataConverter.NO_FILTER);
  ParquetFileReader reader = new ParquetFileReader(conf, path, footer);
mazaneicha
  • 8,794
  • 4
  • 33
  • 52
  • Thanks, I haven't found anything better than this yet either. – Ben Watson Jan 28 '20 at 12:41
  • It's worth noting that this API is deprecated. There are several different ways to read a Parquet file, none of which are documented, and all of which require an array of Hadoop dependencies. The search goes on... – Ben Watson Feb 01 '20 at 22:17
  • Deprecated in what version? Have you searched github for the parquet source code? @Ben – OneCricketeer Feb 02 '20 at 17:50
  • It's deprecated in the version OP linked to, which itself is fairly old (`1.8.3` vs the latest `1.11.0`). The new recommended way is to use a constructor that takes an `InputFile` - so you might think "great I bet there's a `LocalInputFile`, but nope, I can hardly find any subclasses or usages anywhere, and the only one that worked for me was `HadoopInputFile`. – Ben Watson Feb 02 '20 at 20:24
2

Here is a complete sample application, also using the LocalInputFile.java class that is part of the solution above, to read a parquet file with minimum dependencies:

https://github.com/GeoscienceAustralia/wit_tooling/tree/main/examples/java/parquet-reader

In contrast to the other example solutions, this project also avoids the Avro dependency.

Martin
  • 1,777
  • 1
  • 14
  • 18
1

If the need for not using Hadoop is really unavoidable, you can try Spark and run it in a local version. A quick start guide can be find here: https://spark.apache.org/docs/latest/index.html. For downloading, you can download at this link: https://archive.apache.org/dist/spark/ (find a version you like, there is always a build without hadoop. Unfortunately, the size of compressed version is still around 10-15M). You will also able to find some Java example at examples/src/main.

After that, you can read the file in as a Spark Dataframe like this

import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.*; 

SparkSession spark = SparkSession.builder().appName("Reducing dependecy by adding more dependencies").master("local[*]").getOrCreate();
        DataFrame parquet = sqlContext.read().parquet("C:/files/myfile.csv.parquet");
    parquet.show(20);

This solution do satisfy the original conditions in the question. However, it doesn't devoid from the fact that it's like beating around the bush (but hell yeah it's funny). Still, it might helps to open a new possible way to tackle this.

Long Vu
  • 391
  • 2
  • 7
  • Nice answer thanks, that's definitely a novel way to solve the problem. I had actually given this a quick try; after depending on `spark-sql` and some Parquet libraries my application was up to about 93MB (I hadn't tried excluding things or building Spark myself). Spark definitely seems like the preferred solution for dealing with Parquet files in general, although as you say, not for a lightweight solution just for reading local files. Spark also drags in its own host of jar hell issues, which is one of the reasons I was looking to avoid Hadoop. – Ben Watson Feb 05 '20 at 08:29
0

You can use ParquetFileReader class for that

dependencies {
    compile group: 'org.apache.hadoop', name: 'hadoop-common', version: '3.2.0'
    compile group: 'org.apache.parquet', name: 'parquet-hadoop', version: '1.10.1'
}

You can specify your local directory path here

private static Path path = new Path("file:\\C:\\myfile.snappy.parquet");

ParquetFileReader r = new ParquetFileReader(conf, path, footer);
Ben Watson
  • 5,357
  • 4
  • 42
  • 65
UDIT JOSHI
  • 1,298
  • 12
  • 26
  • 1
    OK but this does use both a Hadoop dependency and a Hadoop `Path` object, which is what I'm looking to avoid. Maybe there's some ambiguity in my question so I have gone through and tried to make it clearer. – Ben Watson Feb 04 '20 at 08:51
  • It's working fine without Hadoop Path you can use local Path also there is also constructor available for passing ParquetFileReader r = new ParquetFileReader(Inputfile, options) – UDIT JOSHI Feb 04 '20 at 09:16
  • 1
    When I say "Hadoop Path" I mean any API that uses `org.apache.hadoop.fs.Path`. I also mention `InputFile` in my question, and have only been able to find a `HadoopInputFile` implementation; if you find another that doesn't depend on Hadoop then I will accept that as an answer. – Ben Watson Feb 04 '20 at 09:20
  • `java.io.File` isn't a subclass of `org.apache.parquet.io.InputFile`, so that doesn't work, unless I'm really missing something here. – Ben Watson Feb 04 '20 at 09:44
  • 1
    Please can you link me to the constructor `new ParquetFileReader(new File("C:\\myfile.snappy.parquet") , options)`? – Ben Watson Feb 04 '20 at 10:00
  • https://www.javadoc.io/doc/org.apache.parquet/parquet-hadoop/1.10.1/org/apache/parquet/hadoop/ParquetFileReader.html – UDIT JOSHI Feb 04 '20 at 10:12
  • 1
    Again, that is an `org.apache.parquet.io.InputFile`, `java.io.File` does not implement this interface. Your example does not compile. You cannot pass a `java.io.File` into that constructor. – Ben Watson Feb 04 '20 at 10:15