0

Each month a new pdf file will be added to a specific dir. I am trying to build a data pipe line in R using targets to extract some information from these files.

list(
  tar_target(
    "data_path",
    list.files(path = "dir", full.names = T)
  )
  ,
  tar_target(
    "data_pdf_raw", 
    read_pdf(data_path),
    pattern = map(data_path)
  )
   ,
   tar_target(
     "data_pdf_clean", 
     clead_pdf(data_pdf_raw[[1]]),
     pattern = map(data_pdf_raw)
   )
   ,
   tar_target(
     "data_to_sql",
     data_to_sql(data_pdf_clean)
   )
)

The problem is that targets skip the data_path even thou new files are added in the dir. I have tried format = "file" in data_path without success. I have also tried to add a new target as mentioned in a post below.

tar_target(paths2, list_path, format = "file", pattern = map(data_path)),

As there are quite many pdfs and the process is time consuming I rather not re-read all files every single time.

I have noticed these two questions but the solutions does not work in my case.

Using R Targets to Append New Data to Exisiting Data

How should I use {targets} when I have multiple data files

Pierre
  • 671
  • 8
  • 25

1 Answers1

0

From the {targets} user manual (found here: https://books.ropensci.org/targets/dynamic.html) I figured out how to solve my issue with help from tar_files() in package tarchetypes and format = "feather".

list(
  tar_files(
    "data_path",
    list.files(path = "dir", full.names = T)
  )
  ,
  tar_target(
    "data_pdf_raw", 
    read_pdf(data_path),
    pattern = map(data_path),
    format = "feather"
  )
)
Pierre
  • 671
  • 8
  • 25