I have been testing out the apache-arrow R package to fetch data from S3 (parquet files) for some shiny apps and have had some success. However, while everything works as expected during local development, after deploying to shinyproxy on an EC2 server, the app will work normally at first, but if no new data is pulled from the S3 bucket for ~1 minute the app will crash (or error with a warning) the next time it attempts to fetch data from S3. I will post a simple shiny app below that can reproduce the problem, though reproducing this would require a private S3 bucket and AWS credentials.
The problem does not occur when: 1. running locally directly from RStudio, 2. Running the dockerized app locally, 3. Running the dockerized app via a local shinyproxy install, 4. Running directly on RStudioServer on an EC2 instance. It only occurs using shinyproxy on EC2 and it does not matter if the AWS credentials are hard coded into the app, or the server has a IAM Role with permissions. However, if the files in the S3 bucket are made public, the app will work just fine. So the issue appears to be related to how arrow handles AWS credential, but only under some circumstances. EDIT: It appears that 'Method 2' in the app below works if the S3 bucket/object is public.
The server I am using is currently running ubuntu 22.0.4 and shinyproxy 2.6.1 and is running multiple other shiny apps that access AWS resources using other methods (e.g., pulling from dynamodb) with no issues.
Here is a minimal shiny app that can recreate the problem in 2 different ways:
library(shiny)
library(dplyr)
library(arrow)
library(ggplot2)
ui <- fluidPage(
h1("Method 1:"),
selectInput(
"species1", "Species",
choices = c("setosa", "versicolor", "virginica"),
selected = c("setosa", "versicolor", "virginica"),
multiple = TRUE
),
plotOutput("plot1"),
h1("Method 2:"),
selectInput(
"species2", "Species",
choices = c("setosa", "versicolor", "virginica"),
selected = c("setosa", "versicolor", "virginica"),
multiple = TRUE
),
plotOutput("plot2")
)
server <- function(input, output) {
# Writing the Iris dataset to a private S3 bucket
#URI <- "s3://----YOURBUCKET----/iris-data"
#write_dataset(iris, URI)
# Hard code AWS credentials (for testing)
bucket <- arrow::s3_bucket("----YOURBUCKET----",
access_key="----",
secret_key="-----")
# Method 1
dat1 <- reactive({
arrow::open_dataset(bucket$path("iris-data")) |>
filter(Species %in% input$species1) |>
collect()
})
output$plot1 <- renderPlot({
ggplot(dat1(), aes(x=Sepal.Width, y=Petal.Width, color=Species)) +
geom_point(size=3) +
theme_bw()
})
# Method 2
con <- reactive({ arrow::open_dataset(bucket$path("iris-data")) })
dat2 <- reactive({
con() |>
filter(Species %in% input$species2) |>
collect()
})
output$plot2 <- renderPlot({
ggplot(dat2(), aes(x=Sepal.Width, y=Petal.Width, color=Species)) +
geom_point(size=3) +
theme_bw()
})
}
using Method 1, where open_dataset()
is called every time, fetching data after idling for a few minutes will cause a failure without totally crashing the app. After the error the log just shows:
Warning: Error in fs___FileSystem__GetTargetInfos_Paths: ignoring SIGPIPE signal
192: fs___FileSystem__GetTargetInfos_Paths
191: path_and_fs$fs$GetFileInfo
190: DatasetFactory$create
189: arrow::open_dataset
186: <reactive:dat1> [/app/app.R#51]
184: .func
181: contextFunc
180: env$runWith
173: ctx$run
172: self$.updateValue
170: dat1
168: renderPlot [/app/app.R#57]
166: func
126: drawPlot
112: <reactive:plotObj>
96: drawReactive
83: renderFunc
82: output$plot1
1: shiny::runApp
Using Method 2, where open_dataset()
is only run once, the app will completely crash when trying to fetch data after an idle period. The logs shows:
Warning: stack imbalance in '[[<-', 522 then 531
Warning: stack imbalance in '<-', 513 then 524
Warning: stack imbalance in '$<-', 519 then 520
Warning: stack imbalance in '<-', 511 then 518
Warning: stack imbalance in '$<-', 511 then 508
Warning: stack imbalance in '<-', 502 then 500
Warning: stack imbalance in '$<-', 447 then 442
Warning: stack imbalance in '<-', 437 then 434
*** caught segfault ***
address 0x76b8, cause 'memory not mapped'
For reproducability, here is the Dockerfile:
FROM rocker/verse:4.1.2
RUN apt-get update -y && apt-get install -y libcurl4-openssl-dev libssl-dev gcc make zlib1g-dev git libgit2-dev libfontconfig1-dev libfreetype6-dev libpng-dev libicu-dev libxml2-dev pandoc && rm -rf /var/lib/apt/lists/*
RUN apt-get update && \
apt-get upgrade -y && \
apt-get clean
COPY . ./app
RUN Rscript -e 'install.packages("shiny")'
RUN Rscript -e "Sys.setenv(ARROW_S3='ON'); install.packages('arrow')"
EXPOSE 3838
CMD ["R", "-e", "shiny::runApp('/app', host = '0.0.0.0', port = 3838)"]