1

I want to provide the people I work with, a tool to create parquet files to be use for unit-tests of modules that read and process such files.

I use ParquetViewer to view the content of parquet files, but I like to have a tool to make (sample) parquet files. Is there such a tool to create parquet file with a GUI or some practical CLI otherwise?

Note: I would prefer a cross-platform solution, but if not I am looking for a windows/mingw solution in order to use it at work - where I cannot choose the OS :\

Juh_
  • 14,628
  • 8
  • 59
  • 92
  • What is the data source you want to create the sample file from? – ilmiacs Aug 19 '19 at 16:11
  • The team I speak of develop code that process the parquet files. The data is provided by other teams, and the 1st will ask the 2nd for sample data. But it can take time. On the target deployment, the input files will be stored on hdfs, but this should not be a constraint. – Juh_ Aug 19 '19 at 16:20
  • Sure, but when creating the sample files, they supposed to contain some (sample) data. So is this data and its schema supposed to be entered manually in the tool? Or does the schema come from some schema definition file format and the sample data from excel? – ilmiacs Aug 19 '19 at 16:57
  • The idea is to put it by hand to be able to run the code with actual data before we obtain the actual samples. There are ways around, but I am just wondering. And I am kind of surprised that it's not something easy to find – Juh_ Aug 20 '19 at 07:44
  • If you do it by hand you are prone to bias in the form of potentially misleading assumptions, both concerning the schema (datatypes) and data about the final format, So I would advise against. I would recommend you to ask 2nd team to provide a ddl query and have some sample data say in excel, and try e.g. https://github.com/msafiullah/excel_to_parquet to generate the parquet file via CLI. – ilmiacs Aug 20 '19 at 08:30

2 Answers2

2

parquet-cli written in Java can convert from CSV to parquet.

(This is a sample on Windows)

test.csv is below:

emp_id,dept_id,name,created_at,updated_at
1,1,"test1","2019-02-17 10:00:00","2019-02-17 12:00:00"
2,2,"test2","2019-02-17 10:00:00","2019-02-17 12:00:00"

It requires winutils on Windows. Download and set environment value.

$ set HADOOP_HOME=D:\development\hadoop

Clone parquet-mr, build all and run 'convert-csv' command of parquet-cli.

$ cd parquet-cli
$ java -cp target/classes;target/dependency/* org.apache.parquet.cli.Main convert-csv C:\Users\foo\Downloads\test.csv -o C:\Users\foo\Downloads\test-csv.parquet

'cat' command shows the content of that parquet file.

$ java -cp target/classes;target/dependency/* org.apache.parquet.cli.Main cat C:\Users\foo\Downloads\test-csv.parquet
{"emp_id": 1, "dept_id": 1, "name": "test1", "created_at": "2019-02-17 10:00:00", "updated_at": "2019-02-17 12:00:00"}
{"emp_id": 2, "dept_id": 2, "name": "test2", "created_at": "2019-02-17 10:00:00", "updated_at": "2019-02-17 12:00:00"}
2

copying from this answer: https://stackoverflow.com/a/74010417/220997

You can use DBeaver to create parquet files. Cross-platform. Create an in-memory DuckDB database and then write a query. Some examples here: https://duckdb.org/docs/data/parquet

It still requires some technical knowledge but it's not too bad.

Example code using to output one record with one column.

COPY (SELECT 'test1' as col1) TO 'C:\Users\name\Desktop\result-snappy.parquet' (FORMAT 'parquet');

You can use the same process to view the file.

SELECT * FROM read_parquet('C:\Users\name\Desktop\result-snappy.parquet');
Gabe
  • 5,113
  • 11
  • 55
  • 88