0
  • I have a dataset with x countries over y years.
  • I would like to do a certain analysis (see indicated below, but this code is not the problem)
  • The problem: I would like to do this analysis of the code I already have, a number of times: each time with a different dataset that has another combination of the x countries and y years. To be clear: I would like to do the analysis for EACH possible combination of the x countries and the y years.

The code that I would like to execute for each version of the dataset (explanation dataset see further)

library(stats)    
##### the analysis for one dataset ####
        d=data.frame(outcome_spring=rep(1,999),outcome_summer=rep(1,999),
                     outcome_autumn=rep(1,999),outcome_winter=rep(1,999))


    o <- lapply(1:999, function(i) { 


      Alldata_Rainfed<-subset(Alldata, rainfed <= i)

      outcome_spring=sum(Alldata$spring)
      outcome_summer=sum(Alldata$summer)
      outcome_autumn=sum(Alldata$autumn)
      outcome_winter=sum(Alldata$winter)


      d[i, ] = c(outcome_spring, outcome_summer, outcome_autumn, outcome_winter)


    } )

    combination<-as.data.frame(do.call(rbind, o)) #the output I want is another dataset for each unique dataset

    #### the end of the analysis for one dataset ####

Desired output

That means that as an output I need to have the same amounts of datasets (named "combination" in the example) as the number of combinations possible between x countries and y years.

As an example, imagine having the following dataset (real dataset has over 500000 observations, 15 countries, 9 years)

> dput(Alldata)
structure(list(country = c("belgium", "belgium", "belgium", "belgium", 
"germany", "germany", "germany", "germany"), year = c(2004, 2005, 
2005, 2013, 2005, 2009, 2013, 2013), spring = c(23, 24, 45, 23, 
1, 34, 5, 23), summer = c(25, 43, 654, 565, 23, 1, 23, 435), 
    autumn = c(23, 12, 4, 12, 24, 64, 23, 12), winter = c(34, 
    45, 64, 13, 346, 74, 54, 45), irrigation = c(10, 30, 40, 
    300, 288, 500, 996, 235), id = c(1, 2, 2, 3, 4, 5, 6, 6)), datalabel = "", time.stamp = "14 Nov 2016 20:09", .Names = c("country", 
"year", "spring", "summer", "autumn", "winter", "irrigation", 
"id"), formats = c("%9s", "%9.0g", "%9.0g", "%9.0g", "%9.0g", 
"%9.0g", "%9.0g", "%9.0g"), types = c(7L, 254L, 254L, 254L, 254L, 
254L, 254L, 254L), val.labels = c("", "", "", "", "", "", "", 
""), var.labels = c("", "", "", "", "", "", "", "group(country year)"
), row.names = c("1", "2", "3", "4", "5", "6", "7", "8"), version = 12L, class = "data.frame")

In the example above, I already made an id for combining country and year. That means I want to make datasets with all observations that have combinations of the following ids:

  • dataset 1_2_3_4_5: ids 1, 2, 3, 4, 5 (so this dataset only misses the observations with id = 6)
  • dataset 1_2_3_4_6: ids 1, 2, 3, 4, 6 (but not 5)
  • dataset 1_2: ids 1, 2 (but not all the rest)
  • dataset 3_4_5: ids 3, 4, 5 (but not all the rest)
  • ....

etc etc... Note that I gave the name of the dataset the name of the ids that are included. Otherwise it will be hard for me to distinguish all the different datasets from each other. Other names are fine too, as long as I can distinguish between the datasets!

Thanks for your help!

EDIT: it might be possible that certain datasets give no results (because in the second loop irrigation is used too loop and certain combinations might not have irrigation) but then the output should just be a dataset with missing values

user33125
  • 197
  • 1
  • 3
  • 12

1 Answers1

1

Not sure if this is the most efficient way of doing this, but I think it should work:

# create a df to store the results of all combinations
result=data.frame()

The next loops are based on the combn() function, which creates all possible combinations of a vector (here ID), using m number of elements.

for(i in 2:max(o$id)){
  combis=combn(unique(o$id),i)
  for(j in 1:ncol(combis)){
    sub=o[o$id %in% combis[,j],]
    out=sub[1,]    # use your function
    out$label=paste(combis[,j],collapse ='') #provide an id so you know for which combination this result is
    result=rbind(result,out) # paste it to previous output
  }
}
Wave
  • 1,216
  • 1
  • 9
  • 22
  • thank you very much! I had some trouble understanding it at the beginning but I think I am starting to understand. I am going to run it tomorrow at the university server. Hope it works! I'll let you know! Thank you very much in any case! – user33125 Nov 14 '16 at 22:42
  • Hey @Wave! Your solution worked perfectly. Thank you very much! You helped me a lot! I asked a very small follow-up question: http://stackoverflow.com/questions/40636032/combine-observations-based-on-the-variable-id-if-at-least-5-ids-are-combined Maybe you know the answer to this one as well? If not: thank you very much! – user33125 Nov 16 '16 at 15:34
  • Although @Wave, there is one thing that I cannot explain. I get the following error "Error in combn(unique(Alldata$id), i) : n < m. " But even though I get this error, I do get an output. Is that a problem? Through this link: https://drive.google.com/open?id=0By9u5m3kxn9ybi11OEF5NkhkNDQ an example of the data + the R script can be found. – user33125 Nov 16 '16 at 17:29
  • I can't be sure without seeing the data (google drive is not accessible). But what it means is that unique(id) is of length smaller then m (i in the formula), and you obviously can't take for example 3 combinations out of a vector consisting out of 2 values. Up to you to check what value m and n have. Also, it is easier if you provide ready to use data, with for instance dput(mydata). – Wave Nov 16 '16 at 19:44
  • The documents in google should be available now. I am aware of dput but given it is a larger sample to make it useful, I thought a separate file was more useful. I believe the problem was that I named my ids 10, 20, 30... instead of 1, 2, 3. I added the zero to make the labels more readable. For instance: If R combines 1, 10, and 11 and 2, then the label is 110112. But this is not readable for me. I don't know whether that means 110+11+2 or something else. In total the read dataset has more than 190 ids. By multipliyng the ids by 10, after each final 0, I know it stops. – user33125 Nov 16 '16 at 22:13