R: Reading first n rows from parquet file?

Question

I realise parquet is a column format, but with large files, sometimes you don't want to read it all to memory in R before filtering, and the first 1000 or so rows may be enough for testing. I don't see an option in the read parquet documentation here.

I see a solution for pandas here, and an option for c# here, both of which are not obvious to me how they might translate to R. Suggestions?

Looking through the docs, it seems like arrow gives lazy evaluation. So maybe you can `dplyr::slice_head(n=1000) %>% compute()`? — Dan Adams, Jul 27 '22 at 02:19
Unfortunately `arrow::read_parquet()` does not appear to use lazy evaluation, based on my testing of the time and max memory use to a) read all the file, versus b) a piped implementation of `slice()` as you proposed. - both delivery similar results. — Mark Neal, Jul 27 '22 at 03:02
I think if you use `arrow::open_dataset()` that will index the parquet dataset and set it up for lazy evaluation. More here: https://arrow.apache.org/docs/r/articles/dataset.html — Jon Spring, Jul 27 '22 at 03:04
@Jon is correct, `arrow::open_dataset()` appears to allow lazy evaluation. The lazy object is not compatible with `slice()` , but `head()` or `filter()` works. A good result - thanks! — Mark Neal, Jul 27 '22 at 03:44

Mark Neal · Accepted Answer · 2022-07-27T04:13:53.087

Thanks to Jon and Dan for pointing in the right direction.

arrow::open_dataset() allows lazy evaluation (docs [here][1]), which you can then get the head() from (but not slice()), or filter(). This process is faster, and uses much less peak ram. Example below.

# https://stackoverflow.com/questions/73131505/r-reading-first-n-rows-from-parquet-file

library(dplyr)
library(arrow)
library(tictoc) #optional, used to time results

tic("read all of large parquet file")
my_animals <- read_parquet("data/my_animals.parquet")
toc() # slow and uses heaps of ram

tic("read parquet and write mini version")
my_animals <- open_dataset("data/my_animals.parquet") 
my_animals # this is a lazy object

my_animals %>% 
  #slice(1000L) %>% #doesn't work
  head(n=1000L) %>% 
  # filter(YEAROFBIRTH >= 2010) %>% #also works
  compute() %>% 
  write_parquet("data/my_animals_mini.parquet") # optional
toc() # much faster, much less peak ram used


  [1]: https://arrow.apache.org/docs/r/articles/dataset.html

score 1 · Answer 2 · answered Jul 18 '23 at 11:24

You can use the as_data_frame argument of read_parquet to return the data as an 'Arrow Table' object. You can then use {dplyr} functions on this object, followed by dplyr::collect (collect will return the tibble object, whereas compute merely forces the computation).

library(dplyr)
library(arrow)

my_animals <- read_parquet("data/my_animals.parquet", as_data_frame = FALSE) |>
  slice_head(n = 1000) |>
  collect()

This is readable, fast and memory efficient!

See https://arrow.apache.org/docs/r/articles/data_wrangling.html for more info.

korayp · Answer 3 · 2023-02-18T17:18:07.287

0

I published this simple package for practical usage. https://github.com/mkparkin/Rinvent feel free to check if that can help. There is a parameter called "sample" which brings sample rows. also it can read "delta" files as well

readparquetR(pathtoread="C:/users/...", format="delta", sample=10) or readparquetR(pathtoread="C:/users/...", format="parquet", sample=10)

edited Feb 18 '23 at 17:18

answered Feb 18 '23 at 17:15

korayp

37
5

R: Reading first n rows from parquet file?

3 Answers3