Questions tagged [parquet]

Apache Parquet is a columnar storage format for Hadoop.

Apache Parquet is a columnar storage format for Hadoop.

Parquet was created to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.

References:

3891 questions
204
votes
5 answers

What are the pros and cons of parquet format compared to other formats?

Characteristics of Apache Parquet are : Self-describing Columnar format Language-independent In comparison to Avro, Sequence Files, RC File etc. I want an overview of the formats. I have already read : How Impala Works with Hadoop File Formats ,…
Ani Menon
  • 27,209
  • 16
  • 105
  • 126
196
votes
2 answers

What are the differences between feather and parquet?

Both are columnar (disk-)storage formats for use in data analysis systems. Both are integrated within Apache Arrow (pyarrow package for python) and are designed to correspond with Arrow as a columnar in-memory analytics layer. How do both formats…
Darkonaut
  • 20,186
  • 7
  • 54
  • 65
164
votes
8 answers

How to read a Parquet file into Pandas DataFrame?

How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple…
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90
146
votes
1 answer

Difference between Apache parquet and arrow

I'm looking into a way to speed up my memory intensive frontend vis app. I saw some people recommend Apache Arrow, while I'm looking into it, I'm confused about the difference between Parquet and Arrow. They are both columnized data structure.…
Audrey
  • 1,728
  • 2
  • 11
  • 10
130
votes
6 answers

Avro vs. Parquet

I'm planning to use one of the hadoop file format for my hadoop related project. I understand parquet is efficient for column based query and avro for full scan or when we need all the columns data! Before I proceed and choose one of the file…
Abhishek
  • 6,912
  • 14
  • 59
  • 85
122
votes
13 answers

Inspect Parquet from command line

How do I inspect the content of a Parquet file from the command line? The only option I see now is $ hadoop fs -get my-path local-file $ parquet-tools head local-file | less I would like to avoid creating the local-file and view the file content…
sds
  • 58,617
  • 29
  • 161
  • 278
95
votes
6 answers

Parquet vs ORC vs ORC with Snappy

I am running a few tests on the storage formats available with Hive and using Parquet and ORC as major options. I included ORC once with default compression and once with Snappy. I have read many a documents that state Parquet to be better in…
Rahul
  • 2,354
  • 3
  • 21
  • 30
75
votes
5 answers

A comparison between fastparquet and pyarrow?

After some searching I failed to find a thorough comparison of fastparquet and pyarrow. I found this blog post (a basic comparison of speeds). and a github discussion that claims that files created with fastparquet do not support AWS-athena (btw…
moshevi
  • 4,999
  • 5
  • 33
  • 50
67
votes
11 answers

How to view Apache Parquet file in Windows?

I couldn't find any plain English explanations regarding Apache Parquet files. Such as: What are they? Do I need Hadoop or HDFS to view/create/store them? How can I create parquet files? How can I view parquet files? Any help regarding these…
Sal
  • 5,129
  • 5
  • 27
  • 53
63
votes
5 answers

How to read partitioned parquet files from S3 using pyarrow in python

I looking for ways to read data from multiple partitioned directories from s3 using…
stormfield
  • 1,696
  • 1
  • 14
  • 26
61
votes
10 answers

How do I read a Parquet in R and convert it to an R DataFrame?

I'd like to process Apache Parquet files (in my case, generated in Spark) in the R programming language. Is an R reader available? Or is work being done on one? If not, what would be the most expedient way to get there? Note: There are Java and C++…
metasim
  • 4,793
  • 3
  • 46
  • 70
58
votes
18 answers

Unable to infer schema when loading Parquet file

response = "mi_or_chd_5" outcome = sqlc.sql("""select eid,{response} as response from outcomes where {response} IS NOT NULL""".format(response=response)) outcome.write.parquet(response, mode="overwrite") # Success print…
user48956
  • 14,850
  • 19
  • 93
  • 154
58
votes
8 answers

Python: save pandas data frame to parquet file

Is it possible to save a pandas data frame directly to a parquet file? If not, what would be the suggested process? The aim is to be able to send the parquet file to another team, which they can use scala code to read/open it. Thanks!
Edamame
  • 23,718
  • 73
  • 186
  • 320
57
votes
7 answers

Pandas : Reading first n rows from parquet file?

I have a parquet file and I want to read first n rows from the file into a pandas data frame. What I tried: df = pd.read_parquet(path= 'filepath', nrows = 10) It did not work and gave me error: TypeError: read_table() got an unexpected keyword…
Sanchit Kumar
  • 1,545
  • 1
  • 11
  • 19
55
votes
10 answers

How to convert a csv file to parquet

I'm new to BigData. I need to convert a .csv/.txt file to Parquet format. I searched a lot but couldn't find any direct way to do so. Is there any way to achieve that?
author243
  • 701
  • 1
  • 7
  • 12
1
2 3
99 100