Inspect Parquet from command line

Question

How do I inspect the content of a Parquet file from the command line?

The only option I see now is

$ hadoop fs -get my-path local-file
$ parquet-tools head local-file | less

I would like to

avoid creating the local-file and
view the file content as json rather than the typeless text that parquet-tools prints.

Is there an easy way?

score 127 · Accepted Answer · edited Jan 23 '19 at 00:01

127

You can use parquet-tools with the command cat and the --json option in order to view the files without a local copy and in the JSON format.

Here is an example:

parquet-tools cat --json hdfs://localhost/tmp/save/part-r-00000-6a3ccfae-5eb9-4a88-8ce8-b11b2644d5de.gz.parquet

This prints out the data in JSON format:

{"name":"gil","age":48,"city":"london"}
{"name":"jane","age":30,"city":"new york"}
{"name":"jordan","age":18,"city":"toronto"}

Disclaimer: this was tested in Cloudera CDH 5.12.0

edited Jan 23 '19 at 00:01

James Wierzba

16,176
14
79
120

answered Nov 14 '17 at 14:37

gil.fernandes

12,978
5
63
76

2

Thanks! That was really helpful :) In our case the ```cat --json``` did not work but in the end we used ```parquet-tools csv input.gz.parquet | csvq -f json "select id, description" ``` – GC268DM Jul 08 '21 at 05:49
1

This works for me but...would you please tell me how to only show 3 lines? – puifais Sep 14 '21 at 19:34
3

Is this parquet-cli now? `brew install parquet-cli` then the command is `parquet`? https://github.com/apache/parquet-mr/tree/master/parquet-cli – Josh Hibschman Nov 02 '21 at 18:58
I installed it with `brew install parquet-tools`. You can choose to show only a few lines using the 'head' mode. `parquet-tools head --json [file]' will print the first 5 records. I can't get the --records flag to work, which should let you specify that you want to show 3 records. – Matt Farrow Feb 21 '22 at 17:07

score 88 · Answer 2 · answered Mar 02 '19 at 18:59

Install homebrew on your Mac (see https://brew.sh/) and then just:

brew install parquet-tools

Once you have done that you can user parquet-tools binary (which should now be in your path) at your command line for various commands.

parquet-tools or parquet-tools -h will give you usage info.

Examples:

> parquet-tools rowcount part-00000-fc34f237-c985-4ebc-822b-87fa446f6f70.c000.snappy.parquet 
Total RowCount: 148192

> parquet-tools head -n 1 part-00000-fc34f237-c985-4ebc-822b-87fa446f6f70.c000.snappy.parquet 
:created_at = 2019-02-28T00:16:06.329Z
:id = row-wive~i58u-qaeu
:updated_at = 2019-02-28T00:16:06.329Z
agency = 1
body_style = PA
color = GY
fine_amount = 63
issue_date = 17932
issue_time = 1950
latitude = 64379050
location = 12743 DAVENTRY
longitude = 19261609
make = HYDA
marked_time = 
meter_id = 
plate_expiry_date = 18048
route = 16X2
rp_state_plate = CA
ticket_number = 1020798376
vin = 
violation_code = 22502A#
violation_description = 18 IN. CURB/2 WAY

> parquet-tools meta part-00000-fc34f237-c985-4ebc-822b-87fa446f6f70.c000.snappy.parquet 
file:                  file:/Users/matthewropp/team_demo/los-angeles-parking-citations/raw_citations/issue_month=201902/part-00000-fc34f237-c985-4ebc-822b-87fa446f6f70.c000.snappy.parquet 
creator:               parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a) 
extra:                 org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":":created_at","type":"string","nullable":true,"metadata":{}},{"name":":id","type":"string","nullable":true,"metadata":{}},{"name":":updated_at","type":"string","nullable":true,"metadata":{}},{"name":"agency","type":"integer","nullable":true,"metadata":{}},{"name":"body_style","type":"string","nullable":true,"metadata":{}},{"name":"color","type":"string","nullable":true,"metadata":{}},{"name":"fine_amount","type":"integer","nullable":true,"metadata":{}},{"name":"issue_date","type":"date","nullable":true,"metadata":{}},{"name":"issue_time","type":"integer","nullable":true,"metadata":{}},{"name":"latitude","type":"decimal(8,1)","nullable":true,"metadata":{}},{"name":"location","type":"string","nullable":true,"metadata":{}},{"name":"longitude","type":"decimal(8,1)","nullable":true,"metadata":{}},{"name":"make","type":"string","nullable":true,"metadata":{}},{"name":"marked_time","type":"string","nullable":true,"metadata":{}},{"name":"meter_id","type":"string","nullable":true,"metadata":{}},{"name":"plate_expiry_date","type":"date","nullable":true,"metadata":{}},{"name":"route","type":"string","nullable":true,"metadata":{}},{"name":"rp_state_plate","type":"string","nullable":true,"metadata":{}},{"name":"ticket_number","type":"string","nullable":false,"metadata":{}},{"name":"vin","type":"string","nullable":true,"metadata":{}},{"name":"violation_code","type":"string","nullable":true,"metadata":{}},{"name":"violation_description","type":"string","nullable":true,"metadata":{}}]} 

file schema:           spark_schema 
--------------------------------------------------------------------------------
:                      created_at: OPTIONAL BINARY O:UTF8 R:0 D:1
:                      id: OPTIONAL BINARY O:UTF8 R:0 D:1
:                      updated_at: OPTIONAL BINARY O:UTF8 R:0 D:1
agency:                OPTIONAL INT32 R:0 D:1
body_style:            OPTIONAL BINARY O:UTF8 R:0 D:1
color:                 OPTIONAL BINARY O:UTF8 R:0 D:1
fine_amount:           OPTIONAL INT32 R:0 D:1
issue_date:            OPTIONAL INT32 O:DATE R:0 D:1
issue_time:            OPTIONAL INT32 R:0 D:1
latitude:              OPTIONAL INT32 O:DECIMAL R:0 D:1
location:              OPTIONAL BINARY O:UTF8 R:0 D:1
longitude:             OPTIONAL INT32 O:DECIMAL R:0 D:1
make:                  OPTIONAL BINARY O:UTF8 R:0 D:1
marked_time:           OPTIONAL BINARY O:UTF8 R:0 D:1
meter_id:              OPTIONAL BINARY O:UTF8 R:0 D:1
plate_expiry_date:     OPTIONAL INT32 O:DATE R:0 D:1
route:                 OPTIONAL BINARY O:UTF8 R:0 D:1
rp_state_plate:        OPTIONAL BINARY O:UTF8 R:0 D:1
ticket_number:         REQUIRED BINARY O:UTF8 R:0 D:0
vin:                   OPTIONAL BINARY O:UTF8 R:0 D:1
violation_code:        OPTIONAL BINARY O:UTF8 R:0 D:1
violation_description: OPTIONAL BINARY O:UTF8 R:0 D:1

row group 1:           RC:148192 TS:10503944 OFFSET:4 
--------------------------------------------------------------------------------
:                      created_at:  BINARY SNAPPY DO:0 FPO:4 SZ:607/616/1.01 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 2019-02-28T00:16:06.329Z, max: 2019-03-02T00:20:00.249Z, num_nulls: 0]
:                      id:  BINARY SNAPPY DO:0 FPO:611 SZ:2365472/3260525/1.38 VC:148192 ENC:BIT_PACKED,PLAIN,RLE ST:[min: row-2229_y75z.ftdu, max: row-zzzs_4hta.8fub, num_nulls: 0]
:                      updated_at:  BINARY SNAPPY DO:0 FPO:2366083 SZ:602/611/1.01 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 2019-02-28T00:16:06.329Z, max: 2019-03-02T00:20:00.249Z, num_nulls: 0]
agency:                 INT32 SNAPPY DO:0 FPO:2366685 SZ:4871/5267/1.08 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 1, max: 58, num_nulls: 0]
body_style:             BINARY SNAPPY DO:0 FPO:2371556 SZ:36244/61827/1.71 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: WR, num_nulls: 0]
color:                  BINARY SNAPPY DO:0 FPO:2407800 SZ:111267/111708/1.00 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: YL, num_nulls: 0]
fine_amount:            INT32 SNAPPY DO:0 FPO:2519067 SZ:71989/82138/1.14 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 25, max: 363, num_nulls: 63]
issue_date:             INT32 SNAPPY DO:0 FPO:2591056 SZ:20872/23185/1.11 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 2019-02-01, max: 2019-02-27, num_nulls: 0]
issue_time:             INT32 SNAPPY DO:0 FPO:2611928 SZ:210026/210013/1.00 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 1, max: 2359, num_nulls: 41]
latitude:               INT32 SNAPPY DO:0 FPO:2821954 SZ:508049/512228/1.01 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 99999.0, max: 6513161.2, num_nulls: 0]
location:               BINARY SNAPPY DO:0 FPO:3330003 SZ:1251364/2693435/2.15 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,PLAIN,RLE ST:[min: , max: ZOMBAR/VALERIO, num_nulls: 0]
longitude:              INT32 SNAPPY DO:0 FPO:4581367 SZ:516233/520692/1.01 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 99999.0, max: 1941557.4, num_nulls: 0]
make:                   BINARY SNAPPY DO:0 FPO:5097600 SZ:147034/150364/1.02 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: YAMA, num_nulls: 0]
marked_time:            BINARY SNAPPY DO:0 FPO:5244634 SZ:11675/17658/1.51 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: 959.0, num_nulls: 0]
meter_id:               BINARY SNAPPY DO:0 FPO:5256309 SZ:172432/256692/1.49 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: YO97, num_nulls: 0]
plate_expiry_date:      INT32 SNAPPY DO:0 FPO:5428741 SZ:149849/152288/1.02 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 2000-02-01, max: 2099-12-01, num_nulls: 18624]
route:                  BINARY SNAPPY DO:0 FPO:5578590 SZ:38377/45948/1.20 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: WTD, num_nulls: 0]
rp_state_plate:         BINARY SNAPPY DO:0 FPO:5616967 SZ:33281/60186/1.81 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: AB, max: XX, num_nulls: 0]
ticket_number:          BINARY SNAPPY DO:0 FPO:5650248 SZ:801039/2074791/2.59 VC:148192 ENC:BIT_PACKED,PLAIN ST:[min: 1020798376, max: 4350802142, num_nulls: 0]
vin:                    BINARY SNAPPY DO:0 FPO:6451287 SZ:64/60/0.94 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: , num_nulls: 0]
violation_code:         BINARY SNAPPY DO:0 FPO:6451351 SZ:94784/131071/1.38 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 000, max: 8942, num_nulls: 0]
violation_description:  BINARY SNAPPY DO:0 FPO:6546135 SZ:95937/132641/1.38 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: YELLOW ZONE, num_nulls: 0]

> parquet-tools dump -m -c make part-00000-fc34f237-c985-4ebc-822b-87fa446f6f70.c000.snappy.parquet | head -20
BINARY make 
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 148192 *** 
value 1:      R:0 D:1 V:HYDA
value 2:      R:0 D:1 V:NISS
value 3:      R:0 D:1 V:NISS
value 4:      R:0 D:1 V:TOYO
value 5:      R:0 D:1 V:AUDI
value 6:      R:0 D:1 V:MERC
value 7:      R:0 D:1 V:LEX
value 8:      R:0 D:1 V:BMW
value 9:      R:0 D:1 V:GMC
value 10:     R:0 D:1 V:HOND
value 11:     R:0 D:1 V:TOYO
value 12:     R:0 D:1 V:NISS
value 13:     R:0 D:1 V:
value 14:     R:0 D:1 V:THOR
value 15:     R:0 D:1 V:DODG
value 16:     R:0 D:1 V:DODG
value 17:     R:0 D:1 V:HOND

Didn't work for me under Windows 10 x64 with WSL. See my answer for solving that by building from source. — Contango, Mar 14 '20 at 16:59
I agree, your answer is entirely correct. I thought it might also work on WSL (Windows Subsystem for Linux), as this should be entirely compatible with Linux, but this does not appear to be the case. WSL is reasonably good; most other packages can be installed with brew. And WSL 2 may be better. — Contango, Mar 16 '20 at 13:34
When I try `pip install parquet-tools`, the utility does not have `meta` or `rowcount` commands, only `show`, `csv`, and `inspect`. Is this a different utility or has it changed? — Dahn, Aug 06 '21 at 21:55
@Dahn the version available via `pip` is an older version. I recommend installing via `homebrew`. If you want something that has somewhat similar functionality without homebrew, you could try `pip install parquet-cli` — mropp, Aug 10 '21 at 18:52
Unfortunately the homebrew formula has been deprecated since 31-07-2022 — Oscar Drai, Sep 19 '22 at 13:49
As a work-around for the deprecated formula, I ran `brew edit parquet-tools`, commented out the line starting with `disable!`, saved changes, and ran `HOMEBREW_NO_INSTALL_FROM_API=1 brew install parquet-tools`. However, I ran into an issue due to using an M1 Mac, which I wasn't able to solve: `org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] no native library is found for os.name=Mac and os.arch=aarch64`. I used `parquet-cli` instead. — Illya Moskvin, Jul 06 '23 at 19:49

score 51 · Answer 3 · answered Sep 02 '19 at 09:20

51

I've found this program really useful: https://github.com/chhantyal/parquet-cli

Lets you view parquet files without having the whole infrastructure installed.

Just type:

pip install parquet-cli
parq input.parquet --head 10

answered Sep 02 '19 at 09:20

Camembert

657
7
11

score 27 · Answer 4 · edited Jan 29 '19 at 00:42

27

I recommend just building and running the parquet-tools.jar for your Hadoop distribution.

Checkout the github project: https://github.com/apache/parquet-mr/tree/master/parquet-tools

hadoop jar ./parquet-tools-<VERSION>.jar <command>.

edited Jan 29 '19 at 00:42

Community

1
1

answered Mar 21 '16 at 21:00

fembot

296
3
3

30

On mac you can install with homebrew: brew install parquet-tools – ronhash Jan 31 '18 at 18:06
1

I've uploaded the JAR here (no need to mess around with buggy build instructions on the projects github page): http://www102.zippyshare.com/v/EGBKrDuq/file.html – Aydin K. Feb 06 '18 at 12:41
The github link does not work and zippyshare.com gives so many download links that its confusing. The correct github link is: https://github.com/apache/parquet-mr/tree/master/parquet-tools – smishra Sep 06 '18 at 16:25
3

If you don't want to build you can download the jar from maven site: http://central.maven.org/maven2/org/apache/parquet/parquet-tools/1.10.0/parquet-tools-1.10.0.jar – smishra Sep 06 '18 at 16:34
3

The maven build doesn't include dependencies, so I get an error `java.lang.NoClassDefFoundError: org/apache/hadoop/fs/Path` – Dobes Vandermeer Dec 06 '19 at 22:05
1

As of mid-2020, the link for the parquet-tools JAR on Maven Central is https://repo1.maven.org/maven2/org/apache/parquet/parquet-tools/1.11.0/parquet-tools-1.11.0.jar – Denis Arnaud Aug 03 '20 at 08:51
https://github.com/apache/parquet-mr/tree/master/parquet-cli – Josh Hibschman Nov 02 '21 at 18:55

score 13 · Answer 5 · answered Jan 31 '17 at 19:02

By default parquet-tools in general will look for the local file directory, so to point it to hdfs, we need to add hdfs:// in the beginning of the file path. So in your case, you can do something like this

parquet-tools head hdfs://localhost/<hdfs-path> | less

I had the same issue and it worked fine for me. There is no need to download the file locally first.

Brandon · Answer 6 · 2021-07-16T03:47:05.427

Actually, I find out that pandas has already supported parquet files, as long as you've installed pyarrow or fastparquet as its backend. Check out read_parquet:

import pandas as pd

df = pd.read_parquet('your-file.parquet')

df.head(10)
...

Previous answer: Might be late for the party, but I just learnt that pyarrow supports reading parquet already, and it's quite powerful. Chances are that you already have pyarrow and pandas installed, so you can read parquet just like this

from pyarrow import parquet
import pandas

p = parquet.read_table('/path/to/your/xxxxx.parquet')
df = p.to_pandas()

df.head(10)
...

score 9 · Answer 7 · answered Oct 31 '22 at 09:05

DuckDB has CLI tool (prebuilt binaries for linux, windows, macOS) that can be used to query parquet data from command line.

PS C:\Users\nsuser\dev\standalone_executable_binaries> ./duckdb
v0.5.1 7c111322d
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.

Read parquet data using SQL queries

D SELECT * FROM READ_PARQUET('C:\Users\nsuser\dev\sample_files\userdata1.parquet') limit 3;
┌─────────────────────┬────┬────────────┬───────────┬─────────────────────────┬────────┬────────────────┬──────────────────┬───────────┬───────────┬───────────┬─────────────────────┬──────────┐
│  registration_dttm  │ id │ first_name │ last_name │          email          │ gender │   ip_address   │        cc        │  country  │ birthdate │  salary   │        title        │ comments │
├─────────────────────┼────┼────────────┼───────────┼─────────────────────────┼────────┼────────────────┼──────────────────┼───────────┼───────────┼───────────┼─────────────────────┼──────────┤
│ 2016-02-03 07:55:29 │ 1  │ Amanda     │ Jordan    │ ajordan0@com.com        │ Female │ 1.197.201.2    │ 6759521864920116 │ Indonesia │ 3/8/1971  │ 49756.53  │ Internal Auditor    │ 1E+02    │
│ 2016-02-03 17:04:03 │ 2  │ Albert     │ Freeman   │ afreeman1@is.gd         │ Male   │ 218.111.175.34 │                  │ Canada    │ 1/16/1968 │ 150280.17 │ Accountant IV       │          │
│ 2016-02-03 01:09:31 │ 3  │ Evelyn     │ Morgan    │ emorgan2@altervista.org │ Female │ 7.161.136.94   │ 6767119071901597 │ Russia    │ 2/1/1960  │ 144972.51 │ Structural Engineer │          │
└─────────────────────┴────┴────────────┴───────────┴─────────────────────────┴────────┴────────────────┴──────────────────┴───────────┴───────────┴───────────┴─────────────────────┴──────────┘

Read Parquet Schema.

D DESCRIBE SELECT * FROM READ_PARQUET('C:\Users\nsuser\dev\sample_files\userdata1.parquet');
OR
D SELECT * FROM PARQUET_SCHEMA('C:\Users\nsuser\dev\sample_files\userdata1.parquet');
┌───────────────────┬─────────────┬──────┬─────┬─────────┬───────┐
│    column_name    │ column_type │ null │ key │ default │ extra │
├───────────────────┼─────────────┼──────┼─────┼─────────┼───────┤
│ registration_dttm │ TIMESTAMP   │ YES  │     │         │       │
│ id                │ INTEGER     │ YES  │     │         │       │
│ first_name        │ VARCHAR     │ YES  │     │         │       │
│ birthdate         │ VARCHAR     │ YES  │     │         │       │
│ salary            │ DOUBLE      │ YES  │     │         │       │
└───────────────────┴─────────────┴──────┴─────┴─────────┴───────┘

Read Parquet metadata and statistics.

D SELECT row_group_id, row_group_num_rows, compression, stats_min, stats_max, stats_null_count FROM PARQUET_METADATA('C:\Users\nsuser\dev\sample_files\userdata1.parquet');
┌──────────────┬────────────────────┬──────────────┬─────────────────────┬─────────────────────┬──────────────────┐
│ row_group_id │ row_group_num_rows │ compression  │      stats_min      │      stats_max      │ stats_null_count │
├──────────────┼────────────────────┼──────────────┼─────────────────────┼─────────────────────┼──────────────────┤
│ 0            │ 1000               │ UNCOMPRESSED │ 2016-02-03 22:59:12 │ 2016-02-03 20:51:31 │ 0                │
│ 0            │ 1000               │ UNCOMPRESSED │ 1                   │ 1000                │ 0                │
│ 0            │ 1000               │ UNCOMPRESSED │ "Bonaire            │ Zimbabwe            │ 0                │
│ 0            │ 1000               │ UNCOMPRESSED │                     │ 9/9/1981            │ 0                │
│ 0            │ 1000               │ UNCOMPRESSED │ 12380.49            │ 286592.99           │ 68               │
└──────────────┴────────────────────┴──────────────┴─────────────────────┴─────────────────────┴──────────────────┘

Alternative:

parquet-cli is a light weight python alternative.

pip install parquet-cli          //installs via pip
parq filename.parquet            //view meta data
parq filename.parquet --schema   //view the schema
parq filename.parquet --head 10  //view top n rows

oh that's sweet - thanks for the screenshots and examples to convince me to try `duckdb` — WestCoastProjects, Nov 13 '22 at 06:19
Thanks for the pointer towards DuckDB. They were already on my radar but now I have a real reason to use it. — Saïd, Feb 05 '23 at 08:00

score 5 · Answer 8 · answered Aug 27 '19 at 21:47

5

If you're using HDFS, the following commands are very useful as they are frequently used (left here for future reference):

hadoop jar parquet-tools-1.9.0.jar schema hdfs://path/to/file.snappy.parquet
hadoop jar parquet-tools-1.9.0.jar head -n5 hdfs://path/to/file.snappy.parquet

answered Aug 27 '19 at 21:47

PALEN

2,784
2
23
24

By the way, parquet-tools-1.9.0.jar could be retrieved from maven repo: https://repo1.maven.org/maven2/org/apache/parquet/parquet-tools/1.9.0/parquet-tools-1.9.0.jar – Zhigong Li Oct 28 '21 at 06:40

score 2 · Answer 9 · answered Jan 26 '21 at 10:24

2

If you use Docker you can also do something like this:

docker run -ti -v C:\file.parquet:/tmp/file.parquet nathanhowell/parquet-tools cat /tmp/file.parquet

answered Jan 26 '21 at 10:24

Anxo

81
1
1

score 1 · Answer 10 · answered Mar 14 '20 at 17:37

On Windows 10 x64, try Parq:

choco install parq

This installs everything into the current directory. You will have to add this directory manually to the path, or run parq.exe from within this directory.

My other answer builds parquet-reader from source. This utility looks like it does much the same job.

score 0 · Answer 11 · answered Jul 17 '18 at 09:00

I'd rather use hdfs NFS Gateway + autofs for easy hdfs file investigation.

My setup:

HDFS NFS Gateway service running on namenode.
distribution bundled autofs service on. with following configuration change made to auto.master

/net    -hosts nobind

I can easily run following command to investigate any hdfs file

head /net/<namenodeIP>/path/to/hdfs/file
parquet-tools head /net/<namenodeIP>/path/to/hdfs/par-file
rsync -rv /local/directory/ /net/<namenodeIP>/path/to/hdfs/parentdir/

forget about the hadoop* hdfs* command ;)

Contango · Answer 12 · 2020-03-14T22:22:11.530

On Windows 10 x64 I ended up building parquet-reader just now from source:

Windows 10 + WSL + GCC

Installed WSL with Ubuntu LTS 18.04. Upgraded gcc to v9.2.1 and CMake to latest. Bonus: install Windows Terminal.

git checkout https://github.com/apache/arrow
cd arrow
cd cpp
mkdir buildgcc
cd buildgcc
cmake .. -DPARQUET_BUILD_EXECUTABLES=ON -DARROW_PARQUET=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_BROTLI=ON -DPARQUET_BUILD_EXAMPLES=ON -DARROW_CSV=ON
make -j 20
cd release
./parquet-reader
Usage: parquet-reader [--only-metadata] [--no-memory-map] [--json] [--dump] [--print-key-value-metadata] [--columns=...] <file>

If it has trouble building, may have to use vcpkg for the missing libraries.

Also see a another solution that offers less, but in a simpler way: https://github.com/chhantyal/parquet-cli

Linked from: How can I write streaming/row-oriented data using parquet-cpp without buffering?

Initially tried brew install parquet-tools, but this did not appear to work under my install of WSL

Windows 10 + MSVC

Same as above. Use CMake to generate the Visual Studio 2019 project, then build.

git checkout https://github.com/apache/arrow
cd arrow
cd cpp
mkdir buildmsvc
cd buildmsvc
cmake .. -DPARQUET_BUILD_EXECUTABLES=ON -DARROW_PARQUET=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_BROTLI=ON -DPARQUET_BUILD_EXAMPLES=ON -DARROW_CSV=ON
# Then open the generated .sln file in MSVC and build. Everything should build perfectly.

Troubleshooting:

In case there was any missing libraries, I pointed it at my install of vcpkg. I ran vcpkg integrate install, then copied the to the end of the CMake line:

-DCMAKE_TOOLCHAIN_FILE=[...path...]/vcpkg/scripts/buildsystems

If it had complained about any missing libraries, I would have installed these, e.g. boost, etc using commands like vcpkg install boost:x64.

score 0 · Answer 13 · answered Apr 11 '22 at 07:02

In case anyone else comes to this looking for an easy way to inspect a parquet file from the command line, I wrote the tool clidb to do this.

It doesn’t generate json like the OP wanted but instead shows the parquet data as a table and allows SQL snippets to be run against it. It should work with:

pip install "clidb[extras]"
clidb path/with/data

Inspect Parquet from command line

13 Answers13

Windows 10 + WSL + GCC

Windows 10 + MSVC

Linked