Adding part of filename into column of dataframe - with multiple input-files

Question

Iam trying to extract some data from given ".dta"-files and add a new column to each created dataframe, containing a part of the filename (a year). This has to be done, because I want to do some cohort analysis and therefore track the origin of each observation inside the year column and later on manipulate it to see the cohort for each individual. For ".csv" files I have a somewhat working code, but there are some issues I guess I will face when dealing with the actual data. My code is as follows:

library(data.table) #-> for fread
library(readr) #-> for map_df
library(dplyr) #-> for pipe_operator
library(stringr) #-> for str_sub

### Get desired filenames structure (e.g. 4 digits for year)
filenames <- list.files("~/R/Data", full.names = TRUE, pattern = "*.csv")
sites <- str_sub(filenames, start = -5, end = -5) #just for experimental purpose -5 to -5
### Get length of each file
file_lengths <- unlist(lapply(lapply(filenames, read_csv2), nrow))
### Repeat sites using lengths
file_names <- rep(sites,file_lengths)
###actual file-reading
map_df_fread <- function(path,
                         pattern = "*.csv",
                         sep = ";",
                         dec = ",",
                         colClasses = NULL,
                         select = NULL) {
    list.files(path, pattern, full.names = TRUE) %>% 
    map_dfr(~fread(., sep = sep, dec = dec, stringsAsFactors = F,
                   header = T, colClasses = colClasses,
                   select = select)) %>%
    tibble() %>% mutate(year = file_names)
}

It does what it is supposed to do, at least on small datasets. The actual data has more than 10m observations per variable. Since I want to handle "*.dta" files, I guess I could substitute read_csv2() with read_dta but my concern is that in this step R would read the complete data, which I guess will take an extraordinarily amount of time. Is there anyway to include the first step into my file reading function (which will have to do this 20 times)? I would really like to limit the amount of memory needed for all those computations.

Any help would be appreciated!

Thanks in advance

Donald Seinen · Accepted Answer · 2021-11-16T12:33:38.783

0

df <- data.frame(a = 1:5, b = c(T,F,T,F, F)), df$nms <- "filename" works because it gets recycled. Combined with the imap function that works on an iterator, we can make a column of (manipulated) file names directly instead of reading the files twice. To make this viable for .dta files, simply substitute the relevant I/O functions and the pattern in list.files or function arguments.

# NOTRUN
write.csv(data.frame(a = 1:5, b = c(T, F, T, F, F)), file = "t1.csv")
write.csv(data.frame(a = 1:3, b = c(F, F, F)), file = "t2.csv")

f <- function(path = NULL){
  if(is.null(path)) path <- getwd()
  fls <- as.list(list.files(path = path, pattern = ".csv"))
  string <- lapply(fls, substring, 2, 2)
  lapply(fls, data.table::fread) |>
    purrr::imap(~.x |> transform(year = string[.y]))
}

f()

[[1]]
   V1 a     b year
1:  1 1  TRUE    1
2:  2 2 FALSE    1
3:  3 3  TRUE    1
4:  4 4 FALSE    1
5:  5 5 FALSE    1

[[2]]
   V1 a     b year
1:  1 1 FALSE    2
2:  2 2 FALSE    2
3:  3 3 FALSE    2

Or if you want a single data.frame

do.call(rbind, f())

edited Nov 16 '21 at 12:33

answered Nov 16 '21 at 12:07

Donald Seinen

4,179
5
15
40

Thank you very much for your answer. Just out of curiosity: What does |> mean or where does it come from? I can't find a help file for this (I guess) operator and R just tells me, that this line would cause an error. If you could please help me with that one. Thank you. – BossVom Schloss Nov 16 '21 at 14:50
@BossVomSchloss `|>` is a pipe "operator" native to R as of version 3.5. See `?pipeOp` for its documentation. Its use is very similar to `magrittr` package `%>%`, but differs in that under the hood it is interpreted by the parser so it knows how to evaluate an expression. Much like punctuation in English. To see that in action, try `deparse(substitute(x |> sum() |> exp()))` and compare it with `deparse(substitute(x %>% sum() %>% exp()))`. Its a relatively new addition, so the indication of error will disappear when RStudio developers release an update. – Donald Seinen Nov 16 '21 at 15:10
So, I passed my working directory to your function and unfortunately I recieved the following error: Fehler in fread() : Input is empty or only contains BOM or terminal control characters Called from: fread() Since it worked with read.csv (and still works), the .csv files are definitely not empty. The testfiles just contain 15 variables each with some random numbers and between 10k and 150k observations. What am I missing here? – BossVom Schloss Nov 16 '21 at 15:11
Have you passed the same arguments to the `fread` function as you did in your original approach? That is, `lapply(fls, data.table::fread)` currently doesn't use any additional arguments. You can add them `lapply(fls, data.table::fread, sep = ";", dec = ",")` et cetera. I don't have any .csv files that follow such structure, so the specific arguments to pass I couldn't say – Donald Seinen Nov 16 '21 at 15:13
No, I just left everything blank since fread should automatically see ";" as separator and "," as decimal point, as I've understood the help file. I also didn't specify which columns to import, just to see if fread will load the whole data. – BossVom Schloss Nov 16 '21 at 15:15
I found that I passed fread() instead of fread into the lapply function. After correcting, there is no error but instead a list containing 0 elements as result. :/ ```f <- function(path, pattern = ".csv", sep = ";", dec = ",", colClasses = NULL, select = NULL){ if(is.null(path)) path <- getwd() fls <- as.list(list.files(path = path, pattern = ".csv")) string <- lapply(fls, substring, 2, 6) lapply(fls, fread, sep = sep, dec = dec, colClasses = colClasses, select = select) |> imap(~.x |> transform(year = string[.y])) }``` is my code here – BossVom Schloss Nov 16 '21 at 15:18
@BossVomSchloss When you run it with my sample csv and `data.table::fread`, does it run properly or do you also get the same error? If those test files run fine - then it might be a more difficult problem to solve, having just looked at https://stackoverflow.com/questions/39593637/dealing-with-byte-order-mark-bom-in-r . I am not familiar with this error or issue - if it persist it might be wise to open a separate topic on it. – Donald Seinen Nov 16 '21 at 15:28
Your data works perfectly fine, also from the other directory. I will read into the discussion you suggested above. Thank you very much for your help! Best regards :) – BossVom Schloss Nov 16 '21 at 15:33

Adding part of filename into column of dataframe - with multiple input-files

1 Answers1