I'm running about 45,000 local linear regressions (essentially) on about 1.2 million observations, so I'd appreciate some help trying to speed things up because I'm impatient.
I'm basically trying to construct year-by-position wage contracts--the function wage(experience given firm,year,position)--for a bunch of firms.
Here's the data (basic structure of) set I'm working with:
> wages
firm year position exp salary
1: 0007 1996 4 1 20029
2: 0007 1996 4 1 23502
3: 0007 1996 4 1 22105
4: 0007 1996 4 2 23124
5: 0007 1996 4 2 22700
---
1175141: 994 2012 5 2 47098
1175142: 994 2012 5 2 45488
1175143: 994 2012 5 2 47098
1175144: 994 2012 5 3 45488
1175145: 994 2012 5 3 47098
I want to construct the wage function for experience levels 0 through 40 for all firms, a la:
> salary_scales
firm year position exp salary
1: 0007 1996 4 0 NA
2: 0007 1996 4 1 21878.67
3: 0007 1996 4 2 23401.33
4: 0007 1996 4 3 23705.00
5: 0007 1996 4 4 24260.00
---
611019: 9911 2015 4 36 NA
611020: 9911 2015 4 37 NA
611021: 9911 2015 4 38 NA
611022: 9911 2015 4 39 NA
611023: 9911 2015 4 40 NA
To that end, I've been working (at the suggestion of @BondedDust here) with the COBS (COnstrained B-Spline) package, which allows me to build in the monotonicity of the wage contract.
Some problems remain; in particular, when I need to extrapolate (whenever a given firm doesn't have any very young or very old employees), there's a tendency for the fit to lose monotonicity or to drop below 0.
To get around this, I've been using simple linear extrapolation outside the data bounds--extend the fit curve outside min_exp
and max_exp
so that it passes through the two lowest (or highest) fit points--not perfect, but it seems to be doing pretty well.
With that in mind, here's how I'm doing this so far (keep in mind I'm a data.table
fanatic):
#get the range of experience for each firm
wages[,min_exp:=min(exp),by=.(year,firm,position)]
wages[,max_exp:=max(exp),by=.(year,firm,position)]
#Can't interpolate if there are only 2 or 3 unique experience cells represented
wages[,node_count:=length(unique(exp)),by=.(year,firm,position)]
#Nor if there are too few teachers
wages[,ind_count:=.N,by=.(year,firm,position)]
#Also troublesome when there is little variation in salaries like so:
wages[,sal_scale_flag:=mean(abs(salary-mean(salary)))<50,by=.(year,firm,position)]
wages[,sal_count_flag:=length(unique(salary))<5,by=.(year,firm,position)]
cobs_extrap<-function(exp,salary,min_exp,max_exp,
constraint="increase",print.mesg=F,nknots=8,
keep.data=F,maxiter=150){
#these are passed as vectors
min_exp<-min_exp[1]
max_exp<-min(max_exp[1],40)
#get in-sample fit
in_sample<-predict(cobs(x=exp,y=salary,
constraint=constraint,
print.mesg=print.mesg,nknots=nknots,
keep.data=keep.data,maxiter=maxiter),
z=min_exp:max_exp)[,"fit"]
#append by linear extension below min_exp
c(if (min_exp==1) NULL else in_sample[1]-
(min_exp:1)*(in_sample[2]-in_sample[1]),in_sample,
#append by linear extension above max_exp
if (max_exp==40) NULL else in_sample[length(in_sample)]+(1:(40-max_exp))*
(in_sample[length(in_sample)]-in_sample[length(in_sample)-1]))
}
salary_scales<-
wages[node_count>=7&ind_count>=10
&sal_scale_flag==0&sal_count_flag==0,
.(exp=0:40,
salary=cobs_extrap(exp,salary,min_exp,max_exp)),
by=.(year,firm,position)]
Notice anything in particular that could be slowing down my code? Or am I forced to be patient?
To play around with here are some of the smaller firm-position combos:
firm year position exp salary count
1: 0063 2010 5 2 37433 10
2: 0063 2010 5 2 38749 10
3: 0063 2010 5 4 38749 10
4: 0063 2010 5 8 42700 10
5: 0063 2010 5 11 47967 10
6: 0063 2010 5 15 50637 10
7: 0063 2010 5 19 51529 10
8: 0063 2010 5 23 50637 10
9: 0063 2010 5 33 52426 10
10: 0063 2010 5 37 52426 10
11: 9908 2006 4 1 26750 10
12: 9908 2006 4 6 36043 10
13: 9908 2006 4 7 20513 10
14: 9908 2006 4 8 45023 10
15: 9908 2006 4 13 33588 10
16: 9908 2006 4 15 46011 10
17: 9908 2006 4 15 37179 10
18: 9908 2006 4 22 43704 10
19: 9908 2006 4 28 56078 10
20: 9908 2006 4 29 44866 10