How do I get schema / column names from parquet file?

Question

I have a file stored in HDFS as part-m-00000.gz.parquet

I've tried to run hdfs dfs -text dir/part-m-00000.gz.parquet but it's compressed, so I ran gunzip part-m-00000.gz.parquet but it doesn't uncompress the file since it doesn't recognise the .parquet extension.

How do I get the schema / column names for this file?

The [Apache Arrow project](https://arrow.apache.org/) supports a variety of languages and makes it easy to get the Parquet schema with a variety of different languages. See my answer for more details. — Powers, Sep 21 '20 at 01:20

score 55 · Accepted Answer · edited Jul 16 '22 at 15:12

55

You won't be able "open" the file using a hdfs dfs -text because its not a text file. Parquet files are written to disk very differently compared to text files.

And for the same matter, the Parquet project provides parquet-tools to do tasks like which you are trying to do. Open and see the schema, data, metadata etc.

Check out the parquet-tool project parquet-tools

Also Cloudera which support and contributes heavily to Parquet, also has a nice page with examples on usage of parquet-tools. A example from that page for your use case is

parquet-tools schema part-m-00000.parquet

Checkout the Cloudera page. Using the Parquet File Format with Impala, Hive, Pig, HBase, and MapReduce

edited Jul 16 '22 at 15:12

Chris

1,335
10
19

answered Nov 24 '15 at 01:07

Urvishsinh Mahida

1,440
16
23

2

Thank you. Sounds like a lot more work than I expected! – Super_John Nov 24 '15 at 17:58
3

Here is the [updated repository for parquet-tools](https://github.com/apache/parquet-mr/tree/master/parquet-tools). – Matteo Guarnerio Dec 03 '15 at 08:41
None of the provided github links are working anymore :( – Itération 122442 Apr 09 '21 at 09:49
2

parquet-tools link is broken. – Juha Syrjälä Oct 27 '21 at 08:09
2

Updated link of the tool https://pypi.org/project/parquet-tools/ – Sandeep Singh Apr 29 '22 at 11:02

ns15 · Answer 2 · 2022-10-31T09:12:44.963

Parquet CLI: parquet-cli is a light weight alternative to parquet-tools.

pip install parquet-cli          //installs via pip
parq filename.parquet            //view meta data
parq filename.parquet --schema   //view the schema
parq filename.parquet --head 10  //view top n rows

This tool will provide basic info about the parquet file.

UPDATE (Alternatives):

If you wish to do this using a GUI tool then checkout this answer - View Parquet data and metadata using DBeaver
DuckDB CLI

DuckDB has CLI tool (prebuilt binaries for linux, windows, macOS) that can be used to query parquet data from command line.

PS C:\Users\nsuser\dev\standalone_executable_binaries> ./duckdb
Connected to a transient in-memory database.

Read Parquet Schema.

D DESCRIBE SELECT * FROM READ_PARQUET('C:\Users\nsuser\dev\sample_files\userdata1.parquet');
OR
D SELECT * FROM PARQUET_SCHEMA('C:\Users\nsuser\dev\sample_files\userdata1.parquet');
┌───────────────────┬─────────────┬──────┬─────┬─────────┬───────┐
│    column_name    │ column_type │ null │ key │ default │ extra │
├───────────────────┼─────────────┼──────┼─────┼─────────┼───────┤
│ registration_dttm │ TIMESTAMP   │ YES  │     │         │       │
│ id                │ INTEGER     │ YES  │     │         │       │
│ first_name        │ VARCHAR     │ YES  │     │         │       │
│ salary            │ DOUBLE      │ YES  │     │         │       │
└───────────────────┴─────────────┴──────┴─────┴─────────┴───────┘

more on DuckDB described here.

parquet-tools threw an error about a missing footer, but parquet-cli worked for me. — matmat, Jan 17 '22 at 22:53

score 7 · Answer 3 · answered Apr 08 '18 at 08:30

If your Parquet files are located in HDFS or S3 like me, you can try something like the following:

HDFS

parquet-tools schema hdfs://<YOUR_NAME_NODE_IP>:8020/<YOUR_FILE_PATH>/<YOUR_FILE>.parquet

S3

parquet-tools schema s3://<YOUR_BUCKET_PATH>/<YOUR_FILE>.parquet

Hope it helps.

score 6 · Answer 4 · answered Jan 26 '21 at 10:27

6

If you use Docker you can also run parquet-tools in a container:

docker run -ti -v C:\file.parquet:/tmp/file.parquet nathanhowell/parquet-tools schema /tmp/file.parquet

answered Jan 26 '21 at 10:27

Anxo

81
1
1

best way to run them – scravy Jul 03 '21 at 07:02

Eugene · Answer 5 · 2020-12-27T05:51:50.017

3

Maybe it's capable to use a desktop application to view Parquet and also other binary format data like ORC and AVRO. It's pure Java application so that can be run at Linux, Mac and also Windows. Please check Bigdata File Viewer for details.

It supports complex data type like array, map, etc.

edited Dec 27 '20 at 05:51

answered Feb 09 '20 at 17:52

Eugene

10,627
5
49
67

score 3 · Answer 6 · answered Sep 21 '20 at 01:32

Apache Arrow makes it easy to get the Parquet metadata with a lot of different languages including C, C++, Rust, Go, Java, JavaScript, etc.

Here's how to get the schema with PyArrow (the Python Apache Arrow API):

import pyarrow.parquet as pq

table = pq.read_table(path)
table.schema # pa.schema([pa.field("movie", "string", False), pa.field("release_year", "int64", True)])

See here for more details about how to read metadata information from Parquet files with PyArrow.

You can also grab the schema of a Parquet file with Spark.

val df = spark.read.parquet('some_dir/')
df.schema // returns a StructType

StructType objects look like this:

StructType(
  StructField(number,IntegerType,true),
  StructField(word,StringType,true)
)

From the StructType object, you can infer the column name, data type, and nullable property that's in the Parquet metadata. The Spark approach isn't as clean as the Arrow approach.

score 1 · Answer 7 · answered Apr 14 '21 at 12:57

If you are using R, the following wrapper function on functions existed in arrow library will work for you:

read_parquet_schema <- function (file, col_select = NULL, as_data_frame = TRUE, props = ParquetArrowReaderProperties$create(), 
                                 ...) 
{
  require(arrow)
  reader <- ParquetFileReader$create(file, props = props, ...)
  schema <- reader$GetSchema()
  names <- names(schema)
  return(names)
}

Example:

arrow::write_parquet(iris,"iris.parquet")
read_parquet_schema("iris.parquet")
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

score 0 · Answer 8 · answered Nov 24 '15 at 03:58

0

Since it is not a text file, you cannot do a "-text" on it. You can read it easily through Hive even if you do not have the parquet-tools installed, if you can load that file to a Hive table.

answered Nov 24 '15 at 03:58

Daya Venkatesan

126
4

Thank you. I wish - my current environment doesn't have hive, so I just have pig & hdfs for MR. – Super_John Nov 24 '15 at 06:33
3

unless you know parquet column structure you will not be able to make HIVE table on top of it. – Avinav Mishra Aug 14 '17 at 18:27

How do I get schema / column names from parquet file?

8 Answers8

Linked