Treating binary variables for first difference method to solve autocorrelation issue

Question

I have an autocorrelation problem in my panel data. So I decided to use first difference method so deal with this problem.

Most of my independent variables are binary. So if I do the finite difference method over this, I get -1, 0, and 1 instead of 0 or 1 as before.

Is this ok?

Besides, my data set time flow is as follows which I am not sure how I can apply first difference method in this case when I have multiple difference incidents happening on the same day:

     Date   ID  X   Y   Z   L   M   A   B   C   D   E
 01/01/2017 A   0   1   0   0   0   0   1   0   0   7.8
 01/01/2017 A   0   1   0   0   0   1   0   0   1   6.5
 01/01/2017 B   0   0   0   0   1   1   0   0   1   6.5
 01/03/2017 A   0   1   0   0   0   0   0   0   0   7.8
 01/04/2017 C   0   0   1   0   0   1   0   0   0   6.5
 01/04/2017 C   0   0   0   0   0   0   1   0   0   7.3

I sort this again according to Date and ID which become as follows:

    Date    ID  X   Y   Z   L   M   A   B   C   D   E
 01/01/2017 A   0   1   0   0   0   0   1   0   0   7.8
 01/01/2017 A   0   1   0   0   0   1   0   0   1   6.5
 01/01/2017 B   0   0   0   0   1   1   0   0   1   6.5
 01/03/2017 A   0   1   0   0   0   0   0   0   0   7.8
 01/04/2017 C   0   0   1   0   0   1   0   0   0   6.5
 01/04/2017 C   0   0   0   0   0   0   1   0   0   7.3

Besides, Is this new data sorting ok to use in my Panel regression and also take the first difference over this utilizing this row sequence?

I mean technically don't you get -1, 0, or 1? – Dason Aug 20 '17 at 18:17 — Dason, Aug 20 '17 at 18:17
Yes is -1 ok as a binary variable? – Eric Aug 20 '17 at 18:18 — Eric, Aug 20 '17 at 18:18

score 1 · Accepted Answer · answered Aug 22 '17 at 22:40

1

A regressor may be either time-invariant, or time-varying. For some estimators, notably the within and first differences estimators only the coefficients of time-varying regressors are identified (Cameron and Triverdi, Microeconometric Methods and Applications.). Some of your regressors seem to be time invariant.

You are not dealing with time series, but with panel or longitudinal data. Of course you have repeated ID and dates. That said, you need deal with autocorrelation with panel data tools like Arellano-Bond and Blundell-Bond estimators, to mention a few. See pgmm in R plm package or xtdpdsys or xtabond in Stata.

If you have more than one variable identifying you panel id, than you can aggreagate it using: R create ID within a group. If you are working with Stata you could do: egen id = group(sub_id_1 sub_id_2).

answered Aug 22 '17 at 22:40

Rodrigo Remedio

640
6
20

Thank you so much. I am using plm function now with index = c("year","id"). Is this still ok to do use if I sort my data according to ID then according to time and take the first difference in every row? So there will be a sorting in terms of ID and dates. – Eric Aug 22 '17 at 22:52
Unless you are using the lm function, you should not differenciate by yourself. The more suitable approach would be to specify the model variable in your plm call: `plm(..., model="fd")`. – Rodrigo Remedio Aug 22 '17 at 22:57
I hope so because when I do the dwtest over my formula, I get autocorrelation problem. So I used the first difference method over my raw data first to use as my new raw data. So I remove the first year and first ID from my raw data to match the total data count. I might do "fd" over this again but it won't matter since for me having no autocorrelation problem is the first priority. Besides, is my raw data arrangement looks ok? – Eric Aug 22 '17 at 23:03
It's hard to say Eric. Besides, your model would be becoming more complex to be explained. Take a look at `pgmm` function, where you can specify higher degress of lag dependence and better deal with autocorrelation. However you have to manually specify the whole model. If you have acces to it, Stata's `xtdpdsys` specifyies a model structure based on the seminal papers which originated these methods, making it much easier and didatic for a first approach. – Rodrigo Remedio Aug 22 '17 at 23:10
Thank you so much for your help. I also provided my new raw data arrangement which has been sorted both by date and ID. May I please confirm whether I can use this row sequence especially for my first difference method? – Eric Aug 22 '17 at 23:11
Using your method you shoul first sort for ID and then for the time variable. Using dplyr's notation: `my_data_frame %>% arrange(id, time)`. – Rodrigo Remedio Aug 22 '17 at 23:15
Yes, that's the sequence I follow when I do it manually. Thanks a lot for your confirmation! – Eric Aug 22 '17 at 23:18
Hopefully last question. First I use the first difference method on my raw data and reduced the autocorrelation problem but not significantly. I know you said I should not differentiate by myself. But what if I use "diff()" function over every each variable I call to redefine my raw variables within my code. Then I again add the "fd" command in my plm function and find that the autocorrelation problem is completely gone. Is this a valid approach? – Eric Aug 23 '17 at 00:04
I think I get it. So if I do the first difference over the first difference, isn't it identical to having three lags for each variable? – Eric Aug 23 '17 at 01:20
1

As I said bafore, That maybe a valid approach when working with time series. It's hard to say it is correct for panel data. Besides, you are "losing interpretation" of your parameters. – Rodrigo Remedio Aug 23 '17 at 12:03

Treating binary variables for first difference method to solve autocorrelation issue

1 Answers1