Questions tagged [parquet-mr]
47 questions
13
votes
5 answers
Installing parquet-tools
I am trying to install parquet tools on a FreeBSD machine.
I cloned this repo: git clone https://github.com/apache/parquet-mr
Then I did cd parquet-mr/parquet-tools
Then I did `mvn clean package -Plocal
As specified here:…

user3685285
- 6,066
- 13
- 54
- 95
6
votes
0 answers
Sorted parquet files for query optimization
Question Purpose
Sorting a parquet files provides a number of benefits:
more efficient filtering using file metadata
more efficient compression rate
There may be other benefits for this. There is a lot of discussion about this on the Internet. For…

Amin
- 1,643
- 16
- 25
5
votes
1 answer
Converting Arrow to Parquet and vice versa in java
I have been looking at ways to convert arrow to parquet and vice versa in Java.
Even though the Python library for arrow has full support for the mentioned conversion, I can hardly find any documentation for the same in Java.
Has anyone come across…

Optimus
- 697
- 2
- 8
- 22
4
votes
0 answers
Parquet storage size higher for duplicate data
I have a dataset which has close to 2 billion rows in parquet format which spans in 200 files. It occupies 17.4GB on S3. This dataset has close to 45% of duplicate rows. I deduplicated the dataset using 'distinct' function in Spark, and wrote it to…

Phanindra Kothoori
- 61
- 6
3
votes
2 answers
read a parquet file using Java, but it works in local machine, and doesn't work in docker container
I have a requirement to read parquet files and publish to Kafka in a Java standalone application. I have the below code to read the parquet file which is generated by spark scala application.
public void readTest(Path path) {
try {
…

Sugyan sahu
- 129
- 1
- 8
3
votes
1 answer
INT32 type error when scanning parquet federated table. Bug or Expected behavior?
I am using BigQuery to query an external data source (also known as a federated table), where the source data is a hive-partitioned parquet table stored in google cloud storage. I used this guide to define the table.
My first query to test this…

conradlee
- 12,985
- 17
- 57
- 93
3
votes
0 answers
Is it possible to write multiple oracle database tables into one parquet file?
I have a requirement where I want to convert my oracle DB data to parquet. So in my database I have multiple tables for example Employee, Department.
So is it possible to insert the data of both the tables in single parquet file? Or do i need to…

Ankur Gupta
- 31
- 1
3
votes
1 answer
Why is dictionary page offset 0 for `plain_dictionary` encoding?
The parquet was generated by Spark v2.4 Parquet-mr v1.10
n = 10000
x = [1.0, 2.0, 3.0, 4.0, 5.0, 5.0, None] * n
y = [u'é', u'é', u'é', u'é', u'a', None, u'a'] * n
z = np.random.rand(len(x)).tolist()
dfs = spark.createDataFrame(zip(x, y, z),…

colinfang
- 20,909
- 19
- 90
- 173
2
votes
0 answers
Does Apache Parquet support Custom Filter Predicate on Repeated values?
Does Apache Parquet support Custom Filter Predicate on Repeated values? By applying a filter on a repeated value, I get:
FilterPredicates do not currently support repeated columns. Column
part.x is repeated
The filter I set on the x double…

Nicholas Kou
- 173
- 2
- 13
2
votes
0 answers
parquet-tools cannot read zstd files but can read gzip?
I installed the latest version of parquet-tools from apache-mr with version parquet-tools-1.8.2.jar.
Here is a reproducible example:
>>> import boto3
>>> client = GET_CLIENT() # redacted
>>> import pandas as pd
>>> df = pd.DataFrame([[1,2,3]],…

OneRaynyDay
- 3,658
- 2
- 23
- 56
2
votes
0 answers
Add parquet-tools to path (Visual Studio Code)
I am trying to use this parquet-viewer so I can easily view parquet files in Visual Studio Code.
It requires that parquet-tools are available in the path.
I did
brew install parquet-tools
and when I try to open my .parquet file with Visual Studio…

Mike
- 444
- 1
- 8
- 19
2
votes
0 answers
Read a fastparquet file using Akka parquet
I have one of our Python systems generating Parquet files using Pandas and fastparquet. These are to be read by a Scala system that runs atop Akka streams.
Akka does provide a source for reading Avro Parquet files. However, when I try to read the…

An SO User
- 24,612
- 35
- 133
- 221
2
votes
1 answer
PySpark Write Parquet Binary Column with Stats (signed-min-max.enabled)
I found this apache-parquet ticket https://issues.apache.org/jira/browse/PARQUET-686 which is marked as resolved for parquet-mr 1.8.2. The feature I want is the calculated min/max in the parquet metadata for a (string or BINARY) column.
And…

Nevermore
- 7,141
- 5
- 42
- 64
1
vote
0 answers
AvroParquetWriter - addLogicalTypeConversion not working as expected (using version parquet-avro 1.12.3) - causing ClassCastException
I am writing ResultSet to parquet file using AvroParquetWriter. One column in the ResultSet is java.sql.Timestamp. When writing, I get the exception :
java.sql.Timestamp cannot be cast to java.lang.Number
Adding addLogicalTypeConversion does not…

javaseeker
- 73
- 1
- 9
1
vote
0 answers
How should protobuf message with repeated fields be converted to parquet to be queried by Athena?
We write parquet files to S3 and then use Athena to query from that data. We use "parquet-protobuf" library to convert proto message into parquet record. We recently added a repeated field into our proto message definition and we were expecting to…

user2903819
- 180
- 2
- 12