Demeaning a dataset, I found two different ways that lead to different results

Question

I have the following dataset, and I found two ways to demean it.

library(plm)
library(dplyr)
data("EmplUK", package="plm")
EmplUK <- EmplUK %>%
group_by(firm, year) %>%
mutate(Vote = sample(c(0,1),1) ,
     Vote_won = ifelse(Vote==1, sample(c(0,1),1),0))

# EDIT: 

EmplUK <- pdata.frame(EmplUK , index=c("firm", "year"), drop.index = FALSE)

# A tibble: 1,031 x 9
# Groups:   firm, year [1,031]
    firm  year sector   emp  wage capital output  Vote Vote_won
   <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>  <dbl> <dbl>    <dbl>
 1     1  1977      7  5.04  13.2   0.589   95.7     1        0
 2     1  1978      7  5.60  12.3   0.632   97.4     0        0
 3     1  1979      7  5.01  12.8   0.677   99.6     1        1
 4     1  1980      7  4.72  13.8   0.617  101.      1        1
 5     1  1981      7  4.09  14.3   0.508   99.6     0        0
 6     1  1982      7  3.17  14.9   0.423   98.6     0        0
 7     1  1983      7  2.94  13.8   0.392  100.      0        0
 8     2  1977      7 71.3   14.8  16.9     95.7     1        0
 9     2  1978      7 70.6   14.1  17.2     97.4     1        1
10     2  1979      7 70.9   15.0  17.5     99.6     1        1

This one hear (answer by DaveArmstrong): Visualise the relation between two variables in panel data:

demeaned_data <- EmplUK %>% 
  group_by(firm) %>% 
  mutate(across(c(output, wage), function(x)x-mean(x)))

And this one hear: Demean R data frame

library(plyr)
demean <- colwise(function(x) if(is.numeric(x)) x - mean(x) else x)
demeaned_data.2 <- ddply(EmplUK, .(firm), demean)

Looking at the histogram however, the results are very different, Does one show the difference and the other the mean minus the difference or something? Is that the same?:

hist(demeaned_data$wage, 100)

hist(demeaned_data.2$wage, 100)

The problem is that you are using both `dplyr` and `plyr` which both define a `mutate()` function but they work differently: It's generally not a good idea to load both at the same time, but if you must, you generally want to load `dplyr` *after* loading `plyr`. If you run this in a fresh R session with just the `dplyr` code you will get a the "correct" answer. The demeaned values should be centered around 0. — MrFlick, Jan 03 '21 at 08:16
As you are using package `plm` why don't you simply use it's demeaning (per individual) function? It is called `Within`. — Helix123, Jan 03 '21 at 13:44
@Helix123 Because of this question: https://stackoverflow.com/questions/65537179/visualise-the-relation-between-two-variables-in-panel-data/65539044#65539044 — Tom, Jan 03 '21 at 14:06

Demeaning a dataset, I found two different ways that lead to different results

0 Answers0