13

I am trying to install parquet tools on a FreeBSD machine.

I cloned this repo: git clone https://github.com/apache/parquet-mr

Then I did cd parquet-mr/parquet-tools

Then I did `mvn clean package -Plocal

As specified here: https://github.com/apache/parquet-mr/tree/master/parquet-tools

This is what I got:

enter image description here

Why is this dependency error here? How do I get around it?

user3685285
  • 6,066
  • 13
  • 54
  • 95
  • The error seems to be fairly self-explanatory; the artifact you're looking for isn't in Jitpack's repository. Is it in Maven Central? – Makoto Nov 14 '18 at 18:08
  • Ah, turns out I just needed to checkout the latest stable release tag, not the master branch. – user3685285 Nov 14 '18 at 18:33
  • 3
    [**Do not post images of code or errors!**](https://meta.stackoverflow.com/q/303812/995714) Images and screenshots can be a nice addition to a post, but please make sure the post is still clear and useful without them. If you post images of code or error messages make sure you also copy and paste or type the actual code/message into the post directly. – Rob Nov 15 '18 at 14:09
  • Instead of cloning, download it and follow other required steps. It worked for me this way. I downloaded it from this link. https://github.com/apache/parquet-mr/archive/apache-parquet-1.8.2.tar.gz Cheers! – Keith May 16 '19 at 14:09

5 Answers5

16

On Ubuntu 20, I install via pip:

python3 -m pip install parquet-tools

Haven't tried on FreeBSD but I'd imagine it would also work. See related answer for a caveat on using pip on FreeBSD.

And you can view a file with:

parquet-tools show filename.parquet
Nagev
  • 10,835
  • 4
  • 58
  • 69
  • This answer doesn't help on CentOS Stream 8, as I receive the error "Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-ze7r45bk/pyarrow/"! – Am_I_Helpful Nov 23 '21 at 15:55
  • 1
    You may need to upgrade the `pip` version e.g. `python3 -m pip install --upgrade pip` before installing `parquet-tools`. If that doesn't fix it, you could ask a new question. – Nagev Nov 23 '21 at 17:01
  • Yes, upgrading pip version helped. Thank you (upvoted)! – Am_I_Helpful Nov 23 '21 at 18:45
13

I know the question specifies FreeBSD, but if you're on mac, you can do

brew install parquet-tools

jeffhu
  • 370
  • 3
  • 10
6

parquet-tools is just one module of parquet-mr. It depends on some of the other modules.

When you build from a source version that corresponds to a release, those other modules will be available to Maven, because release artifacts are published as a part of the release process.

However, when building from a snapshot version, you have to make those dependencies available yourself. There are two ways to do so:

Option 1: Build and install all modules of the parent directory:

git clone https://github.com/apache/parquet-mr
cd parquet-mr
mvn install -Plocal

This will put the snapshot artifacts in your local ~/.m2 directory. Subsequently, you can (re)build just parquet-tools like you initially tried, because now the snapshot artifacts will already be available from ~/.m2.

Option 2: Build the parquet-mr modules from the parent directory, while asking Maven to build needed modules as well along the way:

git clone https://github.com/apache/parquet-mr
cd parquet-mr
mvn package -pl parquet-tools -am -Plocal

Option 1 will build more projects than option 2, so if you only need parquet-tools, you are better off with the latter. Please note though that probably both will require installation of a thrift compiler.

Zoltan
  • 2,928
  • 11
  • 25
  • I've been trying to compile parquet-tools only but keep getting the following error: `[ERROR] thrift failed output: [WARNING:/home/user/parquet-mr/parquet-format-structures/target/parquet-format-thrift/parquet.thrift:295] The "byte" type is a compatibility alias for "i8". Use "i8" to emphasize the signedness of this type.` I'm not able to find any solution or workaround for this situation. I've already installed Thrift. `Thrift version 1.0.0` Any ideas? – Zombraz Jan 09 '19 at 20:09
  • That's strange, it's just a warning, it shouldn't fail the build. Nevertheless, try Thrift 0.9.3, that is the one that parquet-mr needs and it won't have this issue. See also https://issues.apache.org/jira/browse/PARQUET-1425 – Zoltan Jan 09 '19 at 21:45
  • 1
    Now I got a different error message. it says the following `/home/edwinalejandro/parquet-mr/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnIndexFilterUtils.java:[30,34] package it.unimi.dsi.fastutil.ints does not exist` I´ve already opened the java file, and it does include the following packages: `import it.unimi.dsi.fastutil.ints.IntArrayList; import it.unimi.dsi.fastutil.ints.IntList;` – Zombraz Jan 09 '19 at 23:55
  • UNless you are going to run with the "hadoop ..." command, you should build with -Dhadoop.scope=compile – Wheezil Oct 31 '20 at 12:04
2

Parquet tools- A utility that can be leveraged to read parquet files. Yuu can clone it from Github and run some maven command.

1. git clone https://github.com/Parquet/parquet-mr.git 
2. cd parquet-mr/parquet-tools/ 
3. mvn clean package -Plocal 


OR You can download stable release & built from local.

  1. Downloading stable Parquet release.

    https://github.com/apache/parquet-mr/archive/apache-parquet-1.8.2.tar.gz


2. Maven local install.

 D:\parquet>cd parquet-tools && mvn clean package -Plocal

enter image description here


3. Test it (paste a parquet file under target directory):

 D:\parquet\parquet-tools\target>java -jar parquet-tools-1.8.2.jar schema out.parquet

(where out.parquet is my parquet file under target directory)

enter image description here

// Read parquet file

D:\parquet\parquet-tools\target>java -jar parquet-tools-1.6.0.jar cat out.parquet

// Read few lines in parquet file

D:\parquet\parquet-tools\target>java -jar parquet-tools-1.6.0.jar head -n5 out.parquet 
Shashank
  • 709
  • 8
  • 16
1

Some answers have broken link for the jar download, but you can get it from maven central

However... this jar and others like it are built so that the hadoop dependencies are "provided" and if you build from source, you'll get that default. So you need to set -Dhadoop.scope=compile when you build, or the result will only work when run on a hadoop node using the "hadoop ..." command.

To make matters worse, this tool apparently disables System.out and System.err so that exceptions that cause main() fails are never printed and you'll be left wondering what happened.

I also found that the default settings for the maven-license-plugin caused it to fail the build when files showed up that it didn't expect (e.g. nbactions.xml if you use netbeans).

Wheezil
  • 3,157
  • 1
  • 23
  • 36