15

How do I read a partitioned parquet file into R with arrow (without any spark)

The situation

  1. created parquet files with a Spark pipe and save on S3
  2. read with RStudio/RShiny with one column as index to do further analysis

The parquet file structure

The parquet files created from my Spark consists of several parts

tree component_mapping.parquet/
component_mapping.parquet/
├── _SUCCESS
├── part-00000-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00001-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00002-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00003-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00004-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── etc

How do I read this component_mapping.parquet into R?

What I tried

install.packages("arrow")
library(arrow)
my_df<-read_parquet("component_mapping.parquet")

but this fails with the error

IOError: Cannot open for reading: path 'component_mapping.parquet' is a directory

It works if I just read one file of the directory

install.packages("arrow")
library(arrow)
my_df<-read_parquet("component_mapping.parquet/part-00000-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet")

but I need to load all in order to query on it

What I found in the documentation

In the apache arrow documentation https://arrow.apache.org/docs/r/reference/read_parquet.html and https://arrow.apache.org/docs/r/reference/ParquetReaderProperties.html I found that there area some properties for the read_parquet() command but I can't get it working and do not find any examples.

read_parquet(file, col_select = NULL, as_data_frame = TRUE, props = ParquetReaderProperties$create(), ...)

How do I set the properties correctly to read the full directory?

# should be this methods
$read_dictionary(column_index)
or
$set_read_dictionary(column_index, read_dict)

Help would be very appreciated

Phil
  • 7,287
  • 3
  • 36
  • 66
Alex Ortner
  • 1,097
  • 8
  • 24

6 Answers6

14

As @neal-richardson alluded to in his answer, more work has been done on this, and with the current arrow package (I'm running 4.0.0 currently) this is possible.

I noticed your files used snappy compression, which requires a special build flag before installation. (Installation documentation here: https://arrow.apache.org/docs/r/articles/install.html)

Sys.setenv("ARROW_WITH_SNAPPY" = "ON")
install.packages("arrow",force = TRUE)

The Dataset API implements the functionality you are looking for, with multi-file datasets. While the documentation does not yet include a wide variety of examples, it does provide a clear starting point. https://arrow.apache.org/docs/r/reference/Dataset.html

The example below shows a minimal example of reading a multi-file dataset from a given directory and converting it to an in-memory R data frame. The API also supports filtering criteria and selecting a subset of columns, though I'm still trying to figure out the syntax myself.

library(arrow)

## Define the dataset
DS <- arrow::open_dataset(sources = "/path/to/directory")
## Create a scanner
SO <- Scanner$create(DS)
## Load it as n Arrow Table in memory
AT <- SO$ToTable()
## Convert it to an R data frame
DF <- as.data.frame(AT)
Matt Summersgill
  • 4,054
  • 18
  • 47
11

Solution for: Read partitioned parquet files from local file system into R dataframe with arrow

As I would like to avoid using any Spark or Python on the RShiny server I can't use the other libraries like sparklyr, SparkR or reticulate and dplyr as described e.g. in How do I read a Parquet in R and convert it to an R DataFrame?

I solved my task now with your proposal using arrow together with lapply and rbindlist

my_df <-data.table::rbindlist(lapply(Sys.glob("component_mapping.parquet/part-*.parquet"), arrow::read_parquet))

looking forward until the apache arrow functionality is available Thanks

Alex Ortner
  • 1,097
  • 8
  • 24
6

Reading a directory of files is not something you can achieve by setting an option to the (single) file reader. If memory isn't a problem, today you can lapply/map over the directory listing and rbind/bind_rows into a single data.frame. There's probably a purrr function that does this cleanly. In that iteration over the files, you also can select/filter on each if you only need a known subset of the data.

In the Arrow project, we're actively developing a multi-file dataset API that will let you do what you're trying to do, as well as push down row and column selection to the individual files and much more. Stay tuned.

Neal Richardson
  • 792
  • 3
  • 3
  • 1
    Mhh but this is kind of the substructure of a parquet file generated from spark. If I read it via parquet-tools I also only enter the main name and it gives me everything in one list. So the only solution would then be to either concatenate all files upfront or manually iterate over every file. But then I loose any parallelism in the read operation – Alex Ortner Oct 17 '19 at 20:42
  • 2
    There are various ways you can parallelize file reads in R, just not currently baked into the `arrow` package. If you want it built into arrow, check back soon, it's coming. – Neal Richardson Oct 17 '19 at 21:31
4

Solution for: Read partitioned parquet files from S3 into R dataframe using arrow

As it tooked me now very long to figure out a solution and I was not able to find anything in the web I would like to share this solution on how to read partitioned parquet files from S3

library(arrow)
library(aws.s3)

bucket="mybucket"
prefix="my_prefix"

# using aws.s3 library to get all "part-" files (Key) for one parquet folder from a bucket for a given prefix pattern for a given component
files<-rbindlist(get_bucket(bucket = bucket,prefix=prefix))$Key

# apply the aws.s3::s3read_using function to each file using the arrow::read_parquet function to decode the parquet format
data <- lapply(files, function(x) {s3read_using(FUN = arrow::read_parquet, object = x, bucket = bucket)})
  
# concatenate all data together into one data.frame
data <- do.call(rbind, data)

What a mess but it works. @neal-richardson is there a using arrow directly to read from S3? I couldn't find something in the documentation for R

Teemu Daniel Laajala
  • 2,316
  • 1
  • 26
  • 37
Alex Ortner
  • 1,097
  • 8
  • 24
0

I am working on this package to make this easier. https://github.com/mkparkin/Rinvent

Right now it can read from Local, AWS S3 or Azure Blob. parquet files or deltafiles

# read parquet from local with where condition in the partition
readparquetR(pathtoread="C:/users/...", add_part_names=F, sample=F, where="sku=1 & store=1", partition="2022")

#read local delta files
readparquetR(pathtoread="C:/users/...", format="delta")

your_connection = AzureStor::storage_container(AzureStor::storage_endpoint(your_link, key=your_key), "your_container")

readparquetR(pathtoread="blobpath/subdirectory/", filelocation = "azure", format="delta", containerconnection = your_connection) 
korayp
  • 37
  • 5
0

Another strategy that worked for me and can process these files using the tidy approach.

This assumes you would want to read all files in directly from s3 without trying to save them locally and have them all concatenated together into a single dataframe or tibble.

library(tidyverse)
library(arrow)

aws_bucket <- arrow::s3_bucket('my-bucket')
files <- aws_bucket$ls('path/to/directory/')
aws_keys <- map(files, aws_bucket$path)
data <- map_dfr(aws_keys, arrow::read_parquet, .progress = TRUE)
DavidWS
  • 1
  • 1