2

I am currently trying to make my code dryer by rewriting some parts with the help of functions. One of the functions I am using is:

datasetperuniversity<-function(university,year){assign(paste("data",university,sep=""),subset(get(paste("originaldata",year,sep="")),get(paste("allcollaboration",university,sep=""))==1))}

Executing the function datasetperuniversity("Harvard","2000") would result within the function in something like this:

dataHarvard=subset(originaldata2000,allcollaborationHarvard==1)

The function runs nearly perfectly, except that it does not store a the results in dataHarvard. I read that this is normal in functions, and using the <<- instead of the = could solve this issue, however since I am making use of the assign function this is not really possible, since the = is just the outcome of the assign function.

Here some data:

sales = c(2, 3, 5,6) 
numberofemployees = c(1, 9, 20,12) 
allcollaborationHarvard = c(0, 1, 0,1) 
originaldata = data.frame(sales, numberofemployees, allcollaborationHarvard)
Ger
  • 249
  • 3
  • 11
  • I guess if you rearrange your data, it will be easier. Don't carry around `originaldata2000`, `originaldata2001`, etc -- just put them together in one table with a year column. And if your `allcolaboration[uni]` cols are mutually exclusive, use one categorical column instead of dummies. For more on this line of thinking if you're interested: https://www.jstatsoft.org/article/view/v059i10 – Frank May 02 '18 at 16:22
  • 1
    @Frank Thanks for this very clear and easy to implement suggestion. Although I will use this method for now, I keep wondering whether my original question is answerable for cases were merging all the datsets is not preferable – Ger May 03 '18 at 07:11

1 Answers1

1

Generally, it's best not to embed data/a variable into the name of an object. So instead of using assign to dataHarvard, make a list data with an element called "Harvard":

# enumerate unis, attaching names for lapply to use
unis = setNames(, "Harvard")

# make a table for each subset with lapply
data = lapply(unis, function(x) 
  originaldata[originaldata[[ paste0("allcollaboration", x) ]] == 1, ]
)

which gives

> data
$Harvard
  sales numberofemployees allcollaborationHarvard
2     3                 9                       1
4     6                12                       1

As seen here, you can use DF[["column name"]] to access a column instead of get as in the OP. Also, see the note in ?subset:

Warning

This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.

Generally, it's also better not to embed data in column names if possible. If the allcollaboration* columns are mutually exclusive, they can be collapsed to a single categorical variable with values like "Harvard", "Yale", etc. Alternately, it might make sense to put the data in long form.

For more guidance on arranging data, I recommend Hadley Wickham's tidy data paper.

Frank
  • 66,179
  • 8
  • 96
  • 180