3

I have a dataframe that looks like this:

weekyear      Location_Id              priceA                   priceB
1    20101        6367                0.8712934                    8
2    20101        6380                0.1712934                    8
3    20102        6367                0.8712934                    4
4    20102        6380                0.4712934                    4
5    20103        6367                0.8712934                    1
6    20103        6380                0.8712934                    9

I would like to demean priceA and priceB. Each is indexed by location and time. I would like for

priceAnew = priceA_{location,time} - mean(over time)(priceA_{location}) - mean(over location)(priceA_{time})

The notation is clearer here: https://stats.stackexchange.com/questions/126549/do-people-used-fixed-effects-in-lasso

Is there a nonpainful way to do this?

Community
  • 1
  • 1
wolfsatthedoor
  • 7,163
  • 18
  • 46
  • 90

1 Answers1

5

I'm going to guess you're looking for something like

transform(dd, 
    newA = priceA-ave(priceA, weekyear)-ave(priceA, Location_Id),
    newB = priceB-ave(priceB, weekyear)-ave(priceB, Location_Id)
)

(where dd is the name of your data.frame). This returns

  weekyear Location_Id    priceA priceB       newA      newB
1    20101        6367 0.8712934      8 -0.5212934 -4.333333
2    20101        6380 0.1712934      8 -0.8546267 -7.000000
3    20102        6367 0.8712934      4 -0.6712934 -4.333333
4    20102        6380 0.4712934      4 -0.7046267 -7.000000
5    20103        6367 0.8712934      1 -0.8712934 -8.333333
6    20103        6380 0.8712934      9 -0.5046267 -3.000000

for your sample input. If you had to do this on many, many columns, i might prefer a loop.

cols <- paste0("price", LETTERS[1:2])
for(col in cols) {
    dd[[paste0("new", col)]] <- dd[[col]] - 
        ave(dd[[col]], dd$weekyear)-
        ave(dd[[col]], dd$Location_Id),
}
MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • 1
    If there are NAs thrown around, ave seems to fail. I don't see anything in the docs about ignoring NAs – wolfsatthedoor Dec 11 '14 at 21:40
  • 1
    I tried x = newA = priceA-ave(priceA, weekyear,FUN=mean,na.rm=TRUE)-ave(priceA, Location_IdFUN=mean,na.rm=TRUE) but it didnt work – wolfsatthedoor Dec 11 '14 at 21:47
  • 1
    I would just create a new helper function: `naave<-function(...) ave(..., FUN=function(x) mean(x, na.rm=T))`. Then just use `naave` rather than `ave` (assuming you want to ignore those values). – MrFlick Dec 11 '14 at 21:53
  • Ugh I'm so sorry to come back, but how might I do this for all my columns (or 98/100 of them) in my data set without typing manually? I tried the natural replacements and it seems transform doesn't like it when I use ddnewcol = dd[column] + etc. (I was planning to loop over all the columns) – wolfsatthedoor Dec 11 '14 at 22:35
  • 1
    I've added an update that will loop over a vector of column names – MrFlick Dec 11 '14 at 22:40
  • This answer actually has a mild bug: one will end up centered around a large negative number when demeaning by multiple groups (n), roughly -mean*N. One must transform by the first group, wait for the process to conclude, then transform by the second. – RegressForward Nov 06 '20 at 19:57