Regression of Results by Subgroup used to Predict using New Data using R

Question

I have a large data file (LMTESTData) that contains internal data and the results of an external assessment. Rather than manually subset, I have tried a number of variants on By and ddply to run a linear regression without success.

colnames(LMTESTData)
 [1] "StudentNumber" "SubjectCode"          "SubjectName"          "ExamMark"    "AssessmentMark"   "U"                "hmkk"            
 [8]  "TESmk"  "Year"

The regression model is lm(hmkk ~ ExamMark + AssessmentMark) for each SubjectCode .

Once the model is working, my next challenge will be to predict hmkk given SubjectCode, ExamMark and AssessmentMark for each StudentNumber.

Dummy Data Set

LMTESTData = data.frame(StudentNumber = 1:100, SubjectCode = c("A","B","C","D","E"),hmkk=rnorm(mean=72, 100),
                ExamMark=rnorm(mean=62, 100),AssessmentMark=rnorm(mean=68, 100))

score 2 · Accepted Answer · answered Aug 01 '15 at 07:03

2

This is classic R lapply-split and if you were delivering just the coefficients (or perhaps predict()-ions) it could be with sapply delivering a matrix:

lapply( split(LMTESTData, LMTESTData$SubjectCode) ),
         function(d) lm(hmkk ~  ExamMark + AssessmentMark, data=d) 
         )

answered Aug 01 '15 at 07:03

IRTFM

258,963
21
364
487

Fantastic, I had been trying lapply but had a syntax error. I'm still learning, for some NEWDATA that has ExamMark, AssessmentMark and SubjectCode, what is the easiest way to produce a table with the hmkk estimate? – DataEdLinks Aug 01 '15 at 07:29
If you have a desire for application of a model to new data, then the `predict` function provides such. – IRTFM Aug 01 '15 at 16:42
I was struggling with how to match against subject code when calling predict. Used data.table and this line: `NewStudents [, TES_EST := predict(lm(hmkk ~ ExamMark + AssessmentMark, data = LMTESData [.BY]), newdata=.SD),by = SubjectCode]` – DataEdLinks Aug 02 '15 at 12:14
The subject code was using a dataframe. The `:=` function is for data.tables. – IRTFM Aug 02 '15 at 14:40

Regression of Results by Subgroup used to Predict using New Data using R

1 Answers1