0

I'm quite new at R and a bit stuck on what I feel is likely a common operation to do. I have a number of files (57 with ~1.5 billion rows cumulatively by 6 columns) that I need to perform basic functions on. I'm able to read these files in and perform the calculations I need no problem but I'm tripping up in the final output. I envision the function working on 1 file at a time, outputting the worked file and moving onto the next.

After calculations I would like to output 57 new .txt files named after the file the input data first came from. So far I'm able to perform the calculations on smaller test datasets and spit out 1 appended .txt file but this isn't what I want as a final output.

#list filenames 
files <- list.files(path=, pattern="*.txt", full.names=TRUE, recursive=FALSE)

#begin looping process
loop_output = lapply(files, 
function(x) {

#Load 'x' file in
DF<- read.table(x, header = FALSE, sep= "\t")

#Call calculated height average a name
R_ref= 1647.038203

#Add column names to .las data
colnames(DF) <- c("X","Y","Z","I","A","FC")

#Calculate return
DF$R_calc <- (R_ref - DF$Z)/cos(DF$A*pi/180)

#Calculate intensity
DF$Ir_calc <- DF$I * (DF$R_calc^2/R_ref^2)

#Output new .txt with calcuated columns
write.table(DF, file=, row.names = FALSE, col.names = FALSE, append = TRUE,fileEncoding = "UTF-8")

})

My latest code endeavors have been to mess around with the intial lapply/sapply function as so:

#begin looping process
loop_output = sapply(names(files), 
function(x) {

As well as the output line:

#Output new .csv with calcuated columns 
write.table(DF, file=paste0(names(DF), "txt", sep="."),
row.names = FALSE, col.names = FALSE, append = TRUE,fileEncoding = "UTF-8")

From what I've been reading the file naming function during write.table output may be one of the keys I don't have fully aligned yet with the rest of the script. I've been viewing a lot of other asked questions that I felt were applicable:

Using lapply to apply a function over list of data frames and saving output to files with different names

Write list of data.frames to separate CSV files with lapply

to no luck. I deeply appreciate any insights or paths towards the right direction on inputting x number of files, performing the same function on each, then outputting the same x number of files. Thank you.

forest_codes
  • 3
  • 1
  • 2
  • `map()` from the `purrr` package works well for this. You can read in a folder of files, keeping them separate, and perform the same set of operations over each one. I would define a function to perform the requisite operations, and then read in, transform, then write with `map()` – Mako212 Jul 19 '17 at 16:46
  • So the issue to your `lapply` code is just the one appended text file? – Parfait Jul 19 '17 at 17:02
  • @Parfait No, it arrives to a similar conclusion as I would like: ie, it calculates what I need to calculate and provides a correct output. However, I want to output 57 individual new files instead of the 1 appended file for data size management and for what I want to do with the files in the next step of my work process. – forest_codes Jul 19 '17 at 17:08
  • Then simply adjust the *file=* argument as @Damian shows in your `write.table` and add a `return(DF)` so your `lapply` returns a list of dataframes and not results of `write.table()` – Parfait Jul 19 '17 at 20:06

2 Answers2

2

The reason the output is directed to the same file is probably that file = paste0(names(DF), "txt", sep=".") returns the same value for every iteration. That is, DF must have the same column names in every iteration, therefore names(DF) will be the same, and paste0(names(DF), "txt", sep=".") will be the same. Along with the append = TRUE option the result is that all output is written to the same file.

Inside the anonymous function, x is the name of the input file. Instead of using names(DF) as a basis for the output file name you could do some transformation of this character string.

example.

Given

x <- "/foo/raw_data.csv"

Inside the function you could do something like this

infile <- x
outfile <- file.path(dirname(infile), gsub('raw', 'clean', basename(infile)))

outfile
[1] "/foo/clean_data.csv"

Then use the new name for output, with append = FALSE (unless you need it to be true)

write.table(DF, file = outfile, row.names = FALSE, col.names = FALSE, append = FALSE, fileEncoding = "UTF-8")
Damian
  • 1,385
  • 10
  • 10
  • thank you for your input and suggestions. This proved to be the trick! Inside the function I put 'inFile' as the first line and then 'outFile' just before my output line as you had written (with append = FALSE in the write.table line). With sapply at the function line my code wasn't working, with lapply at the function line it did. Again, thank you. – forest_codes Jul 19 '17 at 19:41
  • Glad to help, and if the issue is resolved please accept the answer to let others know. (ref: https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work) – Damian Jul 19 '17 at 19:45
  • I like the gsub('raw', 'clean'... function too. Helps with an overwrite issue I was trying to prevent. I'll accept your answer – forest_codes Jul 19 '17 at 20:26
1

Using your code, this is the general idea:

require(purrr)

#list filenames 
files <- list.files(path=, pattern="*.txt", full.names=TRUE, recursive=FALSE)


#Call calculated height average a name
R_ref= 1647.038203

dfTransform <- function(file){
  colnames(file) <- c("X","Y","Z","I","A","FC")

  #Calculate return
  file$R_calc <- (R_ref - file$Z)/cos(file$A*pi/180)

  #Calculate intensity
  file$Ir_calc <- file$I * (file$R_calc^2/R_ref^2)
  return(file)
}

output <- files %>% map(read.table,header = FALSE, sep= "\t") %>%
  map(dfTransform) %>%
  map(write.table, file=paste0(names(DF), "txt", sep="."),
  row.names = FALSE, col.names = FALSE, append = TRUE,fileEncoding = "UTF-8")
Mako212
  • 6,787
  • 1
  • 18
  • 37
  • thank you very much for your answer and introducing me to the 'purrr' package. I've tried your walkthrough and encountered an error at the map(dfTransform) step. Error in names(x) <- value : 'names' attribute [6] must be the same length as the vector [1] – forest_codes Jul 19 '17 at 18:11
  • @forest_codes check that a) all your files have the same number of columns, b) that you're specifying a name for every column in the names vector `colnames(file) <- c("X","Y","Z","I","A","FC")` (you must provide a value for every column). If that still doesn't work, try passing `col.names` as an argument in `read.table` instead. – Mako212 Jul 19 '17 at 18:17
  • cont. This has me a bit perplexed, the data is in .txt format and "\t" separated and that point should be separated into 6 columns. I don't think it has anything to do with your code supplied, but I don't quickly see the error in my data either (I have 2, 5 row files for my small test set) – forest_codes Jul 19 '17 at 18:20
  • @forest_codes You can try `read.delim` which defaults to `/t` as the separator as has default arguments for reading tab delimited files – Mako212 Jul 19 '17 at 18:26
  • I've updated the data files to correct them and now receive a: "In if (file == "") file <- stdout() else if (is.character(file)) { : the condition has length > 1 and only the first element will be used", errot – forest_codes Jul 19 '17 at 19:13
  • Specifically at the dfTransform step the message: "Error in `colnames<-`(`*tmp*`, value = c("X", "Y", "Z", "I", "A", "FC")) : attempt to set 'colnames' on an object with less than two dimensions" arises. – forest_codes Jul 19 '17 at 19:16
  • My data "correction" was to reset the data files the script was using. At some point during the morning and then once again in the afternoon before I had posted this question... I had successfully ran the script separating calculated csv's how I wanted. I'm trying to find what I did correctly then but would also love to understand what's going on here as well if possible – forest_codes Jul 19 '17 at 19:19