According to the article https://diskframe.com/articles/ingesting-data.html a good use case for inmapfn as part of csv_to_disk_frame(...)
is for date conversion. In my data I know the name of the date column at runtime and would like to feed in the date to a convert at read in time function. One issue I am having is that it doesn't seem any additional parameters can be passed into the inmapfn argument beyond the chunk itself. I can't use a hardcoded variable at runtime as the name of the column isn't known until runtime.
To clarify the issue is that the inmapfn seems to run in its own environment to prevent any data races/other parallelisation issues but I know the variable won't be changed so I am hoping there is someway to override this as I can make sure that this is safe.
I know the function I am calling works when called on an arbitrary dataframe.
I have provided a reproducible example below.
library(tidyverse)
library(disk.frame)
setup_disk.frame()
a <- tribble(~dates, ~val,
"09feb2021", 2,
"21feb2012", 2,
"09mar2013", 3,
"20apr2021", 4,
)
write_csv(a, "a.csv")
dates_col <- "dates"
tmp.df <- csv_to_disk.frame(
"a.csv",
outdir = file.path(tempdir(), "tmp.df"),
in_chunk_size = 1L,
inmapfn = function(chunk) {
chunk[, sdate := as.Date(do.call(`$`, list(chunk,dates_col)), "%d%b%Y")]
}
)
#> -----------------------------------------------------
#> Stage 1 of 2: splitting the file a.csv into smallers files:
#> Destination: C:\Users\joelk\AppData\Local\Temp\RtmpcFBBkr\file4a1876e87bf5
#> -----------------------------------------------------
#> Stage 1 of 2 took: 0.020s elapsed (0.000s cpu)
#> -----------------------------------------------------
#> Stage 2 of 2: Converting the smaller files into disk.frame
#> -----------------------------------------------------
#> csv_to_disk.frame: Reading multiple input files.
#> Please use `colClasses = ` to set column types to minimize the chance of a failed read
#> =================================================
#>
#> -----------------------------------------------------
#> -- Converting CSVs to disk.frame -- Stage 1 of 2:
#>
#> Converting 5 CSVs to 6 disk.frames each consisting of 6 chunks
#>
#> Error in do.call(`$`, list(chunk, dates_col)): object 'dates_col' not found