Questions tagged [parquet]

Apache Parquet is a columnar storage format for Hadoop.

Parquet was created to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.

References:

3891 questions

204

votes

5 answers

What are the pros and cons of parquet format compared to other formats?

Characteristics of Apache Parquet are : Self-describing Columnar format Language-independent In comparison to Avro, Sequence Files, RC File etc. I want an overview of the formats. I have already read : How Impala Works with Hadoop File Formats ,…

asked Apr 24 '16 at 10:59

Ani Menon

27,209
16
105
126

196

votes

2 answers

What are the differences between feather and parquet?

Both are columnar (disk-)storage formats for use in data analysis systems. Both are integrated within Apache Arrow (pyarrow package for python) and are designed to correspond with Arrow as a columnar in-memory analytics layer. How do both formats…

python pandas parquet feather pyarrow

asked Jan 03 '18 at 18:48

Darkonaut

20,186
7
54
65

164

votes

8 answers

How to read a Parquet file into Pandas DataFrame?

How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple…

python pandas dataframe parquet blaze

asked Nov 19 '15 at 20:30

Daniel Mahler

7,653
5
51
90

146

votes

1 answer

Difference between Apache parquet and arrow

I'm looking into a way to speed up my memory intensive frontend vis app. I saw some people recommend Apache Arrow, while I'm looking into it, I'm confused about the difference between Parquet and Arrow. They are both columnized data structure.…

parquet apache-arrow

asked Jun 06 '19 at 07:25

Audrey

1,728
2
11
10

130

votes

6 answers

Avro vs. Parquet

I'm planning to use one of the hadoop file format for my hadoop related project. I understand parquet is efficient for column based query and avro for full scan or when we need all the columns data! Before I proceed and choose one of the file…

hadoop avro parquet

asked Mar 10 '15 at 06:19

Abhishek

6,912
14
59
85

122

votes

13 answers

Inspect Parquet from command line

How do I inspect the content of a Parquet file from the command line? The only option I see now is $ hadoop fs -get my-path local-file $ parquet-tools head local-file | less I would like to avoid creating the local-file and view the file content…

parquet

asked Mar 21 '16 at 19:49

sds

58,617
29
161
278

votes

6 answers

Parquet vs ORC vs ORC with Snappy

I am running a few tests on the storage formats available with Hive and using Parquet and ORC as major options. I included ORC once with default compression and once with Snappy. I have read many a documents that state Parquet to be better in…

hadoop hive parquet snappy orc

asked Sep 03 '15 at 10:45

Rahul

2,354
3
21
30

votes

5 answers

A comparison between fastparquet and pyarrow?

After some searching I failed to find a thorough comparison of fastparquet and pyarrow. I found this blog post (a basic comparison of speeds). and a github discussion that claims that files created with fastparquet do not support AWS-athena (btw…

python parquet dask pyarrow fastparquet

asked Jul 16 '18 at 12:00

moshevi

4,999
5
33
50

votes

11 answers

How to view Apache Parquet file in Windows?

I couldn't find any plain English explanations regarding Apache Parquet files. Such as: What are they? Do I need Hadoop or HDFS to view/create/store them? How can I create parquet files? How can I view parquet files? Any help regarding these…

java .net parquet

asked Jun 19 '18 at 16:55

Sal

5,129
5
27
53

votes

5 answers

How to read partitioned parquet files from S3 using pyarrow in python

I looking for ways to read data from multiple partitioned directories from s3 using…

python parquet pyarrow fastparquet python-s3fs

asked Jul 13 '17 at 13:56

stormfield

1,696
1
14
26

votes

10 answers

How do I read a Parquet in R and convert it to an R DataFrame?

I'd like to process Apache Parquet files (in my case, generated in Spark) in the R programming language. Is an R reader available? Or is work being done on one? If not, what would be the most expedient way to get there? Note: There are Java and C++…

r apache-spark parquet sparkr

asked May 22 '15 at 17:05

metasim

4,793
3
46
70

votes

18 answers

Unable to infer schema when loading Parquet file

response = "mi_or_chd_5" outcome = sqlc.sql("""select eid,{response} as response from outcomes where {response} IS NOT NULL""".format(response=response)) outcome.write.parquet(response, mode="overwrite") # Success print…

apache-spark pyspark parquet

asked Jul 06 '17 at 16:54

user48956

14,850
19
93
154

votes

8 answers

Python: save pandas data frame to parquet file

Is it possible to save a pandas data frame directly to a parquet file? If not, what would be the suggested process? The aim is to be able to send the parquet file to another team, which they can use scala code to read/open it. Thanks!

python-3.x hdfs parquet

asked Dec 09 '16 at 18:20

Edamame

23,718
73
186
320

votes

7 answers

Pandas : Reading first n rows from parquet file?

I have a parquet file and I want to read first n rows from the file into a pandas data frame. What I tried: df = pd.read_parquet(path= 'filepath', nrows = 10) It did not work and gave me error: TypeError: read_table() got an unexpected keyword…

python pandas parquet

asked Dec 31 '18 at 01:45

Sanchit Kumar

1,545
1
11
19

votes

10 answers

How to convert a csv file to parquet

I'm new to BigData. I need to convert a .csv/.txt file to Parquet format. I searched a lot but couldn't find any direct way to do so. Is there any way to achieve that?

java parquet

asked Sep 30 '14 at 15:18

author243

2 3

…

99 100 Next