1

I have a Docker container that is setup to run an R script weekly via an Airflow DAG. The DAG has 3 events- 1 that is upstream of the Docker code which takes data from several databases, computes various features and then writes data to S3. This script reads in data from an S3 bucket, formats the data frame, runs a model to score records, then writes data back to S3. Finally there is downstream code that formats the output so that it can be loaded into Salesforce. The script worked while testing when I wrote and built it in December. Recently the run has failed several times with the error code:


    Error in as.character(x) : 
      cannot coerce type 'closure' to vector of type 'character'
    Calls: %>% ... mutate_impl -> ymd -> .parse_xxx -> unlist -> lapply -> FUN
    Execution halted

Ok, so that seems to mean that the date it is reading in as a character is having an issue being formatted as a date. Since 'ymd' is in the chain I believe it to be the Lubridate function in the R script below.

The Docker file (code below) leverages an R image that has the Tidyverse because my code uses Dplyr and Lubridate. I could likely get by without Lubridate and use a base function to format the date, but more on that below

Docker file code:


    FROM rocker/tidyverse

    RUN mkdir -p /model

    RUN apt-get update -qq && apt-get install -y \
          libssl-dev \
          libcurl4-gnutls-dev 

    RUN R -e "install.packages('caret')"

    RUN R -e "install.packages('randomForest')"

    RUN R -e "install.packages('lubridate')"

    RUN R -e "install.packages('aws.s3')"

    EXPOSE 80

    EXPOSE 8787

    COPY / /

    ENTRYPOINT ["Rscript", "account_health_scoring.R"]

R script: I have to exclude the first few lines due to some identifying info and credentials, but the code first just reads in my S3 credentials from a file. Then, this code block runs and fails. There is a good deal of code downstream, but it all functions in the container:

require("dplyr")
require("caret")
require("aws.s3")
require("randomForest")
require("lubridate")

#set credentials
Sys.setenv("AWS_ACCESS_KEY_ID" = "key",
           "AWS_SECRET_ACCESS_KEY" = "key")

#read in model file
s3load("rf_gridsearch.RData", bucket = "account-model")


#read in data
    data<-read.csv(text = rawToChar(get_object((paste0("account_health_data_",
                                                       gsub("-", "_", as.character(Sys.Date()),
                                                            fixed=TRUE),".csv")),
                                               bucket = "account-health-model-input")),
      stringsAsFactors = FALSE)%>%
      mutate(period=ymd(period))%>%
      mutate_if(is.integer,as.numeric)

The reason for the 2 mutate lines is that despite being formatted as a POSIX timestamp, R coerces the date to a string AND coerces floats to integers. Perhaps I am missing something here as well in my read.csv or there is a better function for properly reading data, but this is what I have always used.

Questions:

  1. What is the error message referring to/am I correct to think the YMD function is the culprit?

  2. If so, how can I rewrite my code potentially using base functions to accomplish the same goal and avoid relying on a package.

  3. Could it be package dependencies? In reviewing the logs it doesn't seem that this is the case as Lubridate imports several base functions/uses several. The package has not been updated since I wrote and tested this code.

KWalker
  • 179
  • 1
  • 2
  • 8
  • Require automatically loads them. The log confirms the package is loaded and attached. I will edit to show this. – KWalker Jan 28 '20 at 17:59
  • Try to run only the first line of code that is up to creeation of 'data' `data<-read.csv(text = rawToChar(get_object((paste0("account_health_data_", gsub("-", "_", as.character(Sys.Date()), fixed=TRUE),".csv")), bucket = "account-health-model-input")), stringsAsFactors = FALSE)` and then get the `str(data)` – akrun Jan 28 '20 at 18:02
  • Here the `ymd` iimplies that the column is in format year month day and suppose if it is not that format, then it needs to be changed – akrun Jan 28 '20 at 18:03
  • The column is in that format. The date is stored as a POSIX timestamp and the data cleaning script that runs upstream strips the time stamp off of it so that it is in the write format. – KWalker Jan 28 '20 at 18:13

1 Answers1

0

Well, the answer seems to be simple although I do not understand it. I changed


    require(lubridate)

to


    library(lubridate)

And it builds. I found this post What is the difference between require() and library()? and decided to just try changing it, built the container, and it worked. I'm still trying to understand "why".

KWalker
  • 179
  • 1
  • 2
  • 8