What's the appropriate notation for representing intermediate processing steps in a function?

Question

So let's say I've defined the below function to read in a set of files:

read_File <- function(file){
    #  read Excel file
    df1 <- read.xls(file,sheet=1, pattern="Name", header=T, na.strings=("##"), stringsAsFactors=F)
    #  remove rows where Name is empty
    df2 <- df1[-which(df1$Name==""),]
    #  remove rows where "Name" is repeated
    df3 <- df2[-which(df2$Name=="Name"),]
    #  remove all empty columns (anything with an auto-generated colname)
    df4 <- df3[, -grep("X\\.{0,1}\\d*$", colnames(df3))]
    row.names(df4) <- NULL
    df4$FileName <- file
    return(df4)
}

It works fine like this, but it feels like bad form to define df1...df4 to represent the intermediate steps. Is there a better way to do this without compromising readability?

Gregor Thomas · Accepted Answer · 2017-08-10T18:01:34.333

I see no reason to save intermediate objects separately unless they need to be used multiple times. This is not the case in your code, so I would replace all your df[0-9] with df:

read_File <- function(file){
    #  read Excel file
    df <- read.xls(file,sheet = 1, pattern = "Name", header = T,
                   na.strings = ("##"), stringsAsFactors = F)
    #  remove rows where Name is empty
    df <- df[-which(df$Name == ""), ]
    #  remove rows where "Name" is repeated
    df <- df[-which(df$Name == "Name"), ]
    #  remove all empty columns (anything with an auto-generated colname)
    df <- df[, -grep("X\\.{0,1}\\d*$", colnames(df))]
    row.names(df) <- NULL
    df$FileName <- file
    return(df)
}

df3 is not a nice descriptive variable name - it doesn't tell you anything more about the variable then df. Sequentially naming variables steps like that also creates a maintenance burden: if you need to add a new step in the middle, you will need to rename all subsequent objects to maintain consistency - which sounds both annoying and potentially risky for bugs. (Or have something hacky like df2.5, which is ugly and doesn't generalize well.) Generally, I think sequentially named variables are almost always bad practice, even when they are separate objects that you need saved.

Furthermore, keeping the intermediate objects around is not good use of memory. In most cases it won't matter, but if your data is large than saving all the intermediate steps separately will greatly increase the amount of memory used during processing.

The comments are excellent, lots of detail - they tell you all you need to know about what's going on in the code.

If it were me, I would probably combine some steps, something like this:

read_File <- function(file){
    #  read Excel file
    df <- read.xls(file,sheet = 1, pattern = "Name", header = T,
                   na.strings = ("##"), stringsAsFactors = F)
    #  remove rows where Name is bad:
    bad_names <- c("", "Name")
    df <- df[-which(df$Name %in% bad_names), ]

    #  remove all empty columns (anything with an auto-generated colname)
    df <- df[, -grep("X\\.{0,1}\\d*$", colnames(df))]
    row.names(df) <- NULL
    df$FileName <- file
    return(df)
}

Having a bad_names vector to omit saves a line and is more parametric - it would be trivial to promote bad_names to a function argument (perhaps with the default value c("", "Name")) so that the user could customize it.

Agreed. This is very nice code and does not need the extra variables. I do tend to use `temp` or `junk` for variable names when I am developing code, but it comes back to haunt me sometimes. — Kevin, Aug 10 '17 at 17:36

What's the appropriate notation for representing intermediate processing steps in a function?

1 Answers1