How do I read a Parquet in R and convert it to an R DataFrame?

Question

I'd like to process Apache Parquet files (in my case, generated in Spark) in the R programming language.

Is an R reader available? Or is work being done on one?

If not, what would be the most expedient way to get there? Note: There are Java and C++ bindings: https://github.com/apache/parquet-mr

score 38 · Accepted Answer · edited Mar 25 '23 at 13:27

The simplest way to do this is to use the arrow package for this, which is available on CRAN.

install.packages("arrow")
library(arrow)
read_parquet("somefile.parquet")

Previously this could be done through Python using pyarrow but this nowadays also comes packaged for R without the need for Python.

If you do not want to install from CRAN you can build directly, or install from GitHub:

git clone https://github.com/apache/arrow.git
cd arrow/cpp && mkdir release && cd release

# It is important to statically link to boost libraries
cmake .. -DARROW_PARQUET=ON -DCMAKE_BUILD_TYPE=Release -DARROW_BOOST_USE_SHARED:BOOL=Off
make install

Then you can install the R arrow package:

devtools::install_github("apache/arrow/r")

And use it to load a Parquet file

library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
#> The following objects are masked from 'package:base':
#> 
#>     array, table
read_parquet("somefile.parquet", as_tibble = TRUE)
#> # A tibble: 10 x 2
#>        x       y
#>    <int>   <dbl>
#> …

{arrow} has hit CRAN recently: https://cran.r-project.org/web/packages/arrow/index.html — krlmlr, Aug 09 '19 at 08:38

score 36 · Answer 2 · answered Aug 09 '19 at 06:12

36

You can simply use the arrow package:

install.packages("arrow")
library(arrow)
read_parquet("myfile.parquet")

answered Aug 09 '19 at 06:12

fc9.30

2,293
20
19

1

@DavidArenburg True, though this answers reflects the change that `arrow` is now available on CRAN and hence can be directly installed. – B.Liu Oct 29 '19 at 10:57

Andy Judson · Answer 3 · 2015-06-26T21:31:07.590

30

If you're using Spark then this is now relatively simple with the release of Spark 1.4 see sample code below that uses the SparkR package that is now part of the Apache Spark core framework.

# install the SparkR package
devtools::install_github('apache/spark', ref='master', subdir='R/pkg')

# load the SparkR package
library('SparkR')

# initialize sparkContext which starts a new Spark session
sc <- sparkR.init(master="local")

# initialize sqlContext
sq <- sparkRSQL.init(sc)

# load parquet file into a Spark data frame and coerce into R data frame
df <- collect(parquetFile(sq, "/path/to/filename"))

# terminate Spark session
sparkR.stop()

An expanded example is shown @ https://gist.github.com/andyjudson/6aeff07bbe7e65edc665

I'm not aware of any other package that you could use if you weren't using Spark.

edited Jun 26 '15 at 21:31

answered Jun 26 '15 at 18:43

Andy Judson

441
1
5
8

Any guess as to why this won't work with my parquet files on S3? Key and secret are declared, and python can read them fine with sqlCtx.read.parquet("s3n://bucket/path/part1=whatever/"). I've also tried putting the key and password directly in the url - "s3n://key:pass@bucket/path" – Ben Hunter Aug 24 '15 at 19:22
Afraid I've not used S3 so I'm not sure what works or not. It could be that S3 is not supported yet in SparkR, we've really just seen the 1st release of it in core and you do run into issues. Have you confirmed the loading of the data from a pyspark / scala session? - If so, I'd be more tempted by the above theory. It may be worth checking the SparkR issue logs (https://issues.apache.org/jira/browse/SPARK/component/12325400/?selectedTab=com.atlassian.jira.jira-projects-plugin:component-summary-panel) or just try searching for issues relating to S3? – Andy Judson Aug 25 '15 at 15:25
Try defining AWS env vars in your file or when starts RStudio: I prefer to use self contained scripts (the first option) Sys.setenv(AWS_ACCESS_KEY_ID = "") Sys.setenv(AWS_SECRET_ACCESS_KEY = "") – seufagner Jan 19 '16 at 20:27
Using this results in two warnings -- "`parquetFile(sqlContext...)` is deprecated". Use `parquetFile(...)` instead." --and-- "`f' is deprecated. Use `read.parquet` instead.". Unfortunately _none_ of `parquetFile` or `read.parquet` are documented, so it's not clear the proper syntax for implementing this now – MichaelChirico Dec 01 '17 at 10:25

score 16 · Answer 4 · answered Nov 09 '17 at 15:49

16

Alternatively to SparkR, you could now use sparklyr:

# install.packages("sparklyr")
library(sparklyr)

sc <- spark_connect(master = "local")

spark_tbl_handle <- spark_read_parquet(sc, "tbl_name_in_spark", "/path/to/parquetdir")

regular_df <- collect(spark_tbl_handle)

spark_disconnect(sc)

answered Nov 09 '17 at 15:49

Aurèle

12,545
1
31
49

In order to use collect the package dplyr is needed: `install.packages("dplyr")` – nessa.gp May 03 '18 at 11:22

score 6 · Answer 5 · answered Feb 20 '19 at 17:39

6

With reticulate you can use pandas from python to parquet files. This could save you the hassle from running a spark instance.

library(reticulate)
library(dplyr)
pandas <- import("pandas")
read_parquet <- function(path, columns = NULL) {

  path <- path.expand(path)
  path <- normalizePath(path)

  if (!is.null(columns)) columns = as.list(columns)

  xdf <- pandas$read_parquet(path, columns = columns)

  xdf <- as.data.frame(xdf, stringsAsFactors = FALSE)

  dplyr::tbl_df(xdf)

}

read_parquet(PATH_TO_PARQUET_FILE)

answered Feb 20 '19 at 17:39

Jonathan

611
2
7
15

1

I feel like this should not be selected as the answer given the native R approaches below. – Kermit Jun 30 '19 at 00:19
1

honestly imo: the answers below require spark, which isn't native R. CMIIW – Jonathan Jul 21 '19 at 16:36
IMO, this is a valid answer, and might even be the "best" answer _sometimes_. However, in most situations, one of the other solutions will be "better". – Nzbuu Oct 19 '20 at 15:20

Zmnako Awrahman · Answer 6 · 2018-06-22T12:08:33.730

Spark has been updated and there are many new things and functions which are either deprecated or renamed.

Andy's answer above is working for spark v.1.4 but on spark v.2.3 this is the update where it worked for me.

Download latest version of apache spark https://spark.apache.org/downloads.html (point 3 in the link)
extract the .tgz file.
install devtool package in rstudio
```
install.packages('devtools')
```

Open terminal and follow these steps

# This is the folder of extracted spark `.tgz` of point 1 above
export SPARK_HOME=extracted-spark-folder-path 
cd $SPARK_HOME/R/lib/SparkR/
R -e "devtools::install('.')"

Go back to rstudio

# load the SparkR package
library(SparkR)

# initialize sparkSession which starts a new Spark session
sc <- sparkR.session(master="local")

# load parquet file into a Spark data frame and coerce into R data frame
df <- collect(read.parquet('.parquet-file-path'))

# terminate Spark session
sparkR.stop()

score 3 · Answer 7 · answered Oct 03 '19 at 11:27

miniparquet is a new dedicated package. Install with:

devtools::install_github("hannesmuehleisen/miniparquet")

Example taken from the documentation:

library(miniparquet)

f <- system.file("extdata/userdata1.parquet", package="miniparquet")
df <- parquet_read(f)
str(df)

# 'data.frame': 1000 obs. of  13 variables:
#  $ registration_dttm: POSIXct, format: "2016-02-03 07:55:29" "2016-02-03 17:04:03" "2016-02-03 01:09:31" ...
#  $ id               : int  1 2 3 4 5 6 7 8 9 10 ...
#  $ first_name       : chr  "Amanda" "Albert" "Evelyn" "Denise" ...
#  $ last_name        : chr  "Jordan" "Freeman" "Morgan" "Riley" ...
#  $ email            : chr  "ajordan0@com.com" "afreeman1@is.gd" "emorgan2@altervista.org" "driley3@gmpg.org" ...
#  $ gender           : chr  "Female" "Male" "Female" "Female" ...
#  $ ip_address       : chr  "1.197.201.2" "218.111.175.34" "7.161.136.94" "140.35.109.83" ...
#  $ cc               : chr  "6759521864920116" "" "6767119071901597" "3576031598965625" ...
#  $ country          : chr  "Indonesia" "Canada" "Russia" "China" ...
#  $ birthdate        : chr  "3/8/1971" "1/16/1968" "2/1/1960" "4/8/1997" ...
#  $ salary           : num  49757 150280 144973 90263 NA ...
#  $ title            : chr  "Internal Auditor" "Accountant IV" "Structural Engineer" "Senior Cost Accountant" ...
#  $ comments         : chr  "1E+02" "" "" "" ...

score 1 · Answer 8 · answered Apr 19 '16 at 13:26

1

For reading a parquet file in an Amazon S3 bucket, try using s3a instead of s3n. That worked for me when reading parquet files using EMR 1.4.0, RStudio and Spark 1.5.0.

answered Apr 19 '16 at 13:26

Betsy Nichols

27
3

score 1 · Answer 9 · answered Oct 26 '20 at 13:39

1

If you have a multi-file parquet file, you might need to do something like this :

data.table::rbindlist(lapply(Sys.glob("path_to_parquet/part-*.parquet"), arrow::read_parquet))

answered Oct 26 '20 at 13:39

John Waller

2,257
4
21
26

I wrote and r package to do thi https://github.com/jhnwllr/parqr – John Waller Aug 13 '21 at 11:10

score 1 · Answer 10 · answered Feb 18 '23 at 18:32

recently I published an R package to read parquet and delta files. it is basically using arrow package in it, however it deals with delta files in local & cloud.

you can use like

 readparquetR(pathtoread="C:/users/...",format="delta")# format can be parquet or delta

if you want to read directly from azure, this should work

 readparquetR(pathtoread="blobpath/subdirectory/",    filelocation = "azure",    format="delta",    containerconnection = your_connection)

feel free to use it or contribute https://github.com/mkparkin/Rinvent

How do I read a Parquet in R and convert it to an R DataFrame?

10 Answers10

Linked