I see no reason to save intermediate objects separately unless they need to be used multiple times. This is not the case in your code, so I would replace all your df[0-9]
with df
:
read_File <- function(file){
# read Excel file
df <- read.xls(file,sheet = 1, pattern = "Name", header = T,
na.strings = ("##"), stringsAsFactors = F)
# remove rows where Name is empty
df <- df[-which(df$Name == ""), ]
# remove rows where "Name" is repeated
df <- df[-which(df$Name == "Name"), ]
# remove all empty columns (anything with an auto-generated colname)
df <- df[, -grep("X\\.{0,1}\\d*$", colnames(df))]
row.names(df) <- NULL
df$FileName <- file
return(df)
}
df3
is not a nice descriptive variable name - it doesn't tell you anything more about the variable then df
. Sequentially naming variables steps like that also creates a maintenance burden: if you need to add a new step in the middle, you will need to rename all subsequent objects to maintain consistency - which sounds both annoying and potentially risky for bugs.
(Or have something hacky like df2.5
, which is ugly and doesn't generalize well.) Generally, I think sequentially named variables are almost always bad practice, even when they are separate objects that you need saved.
Furthermore, keeping the intermediate objects around is not good use of memory. In most cases it won't matter, but if your data is large than saving all the intermediate steps separately will greatly increase the amount of memory used during processing.
The comments are excellent, lots of detail - they tell you all you need to know about what's going on in the code.
If it were me, I would probably combine some steps, something like this:
read_File <- function(file){
# read Excel file
df <- read.xls(file,sheet = 1, pattern = "Name", header = T,
na.strings = ("##"), stringsAsFactors = F)
# remove rows where Name is bad:
bad_names <- c("", "Name")
df <- df[-which(df$Name %in% bad_names), ]
# remove all empty columns (anything with an auto-generated colname)
df <- df[, -grep("X\\.{0,1}\\d*$", colnames(df))]
row.names(df) <- NULL
df$FileName <- file
return(df)
}
Having a bad_names
vector to omit saves a line and is more parametric - it would be trivial to promote bad_names
to a function argument (perhaps with the default value c("", "Name")
) so that the user could customize it.