I have been trying to determine the correlation between variables in panel data. Using the base cor() function does not account for fixed effects. My data is in the form (with more entities, years and variables, some values of x are 0):
entity | year | var1 | var2 | var3 |
---|---|---|---|---|
1 | 2000 | x | x | x |
1 | 2001 | x | x | x |
1 | 2002 | x | x | x |
2 | 2000 | x | x | x |
2 | 2001 | x | x | x |
2 | 2002 | x | x | x |
3 | 2000 | x | x | x |
3 | 2001 | x | x | x |
3 | 2002 | x | x | x |
I have tried using the plm package and the cortab function, but it appears to find the correlation between groups of entities for the same variable. Other solutions I have found online don't seem to calulcate the correlation correctly.
The output should look like:
var1 | var2 | var3 | |
---|---|---|---|
var1 | x | x | x |
var2 | x | x | x |
var3 | x | x | x |
The data I am using is balanced, the plan is to use the script on a variety of datasets, a different script removes non-numeric values and will ensure it is in this format.
Most correlation method will just find the correlation between two variable columns. However, this can lead to miscalculations for my purpose. In the screeenshot Var1 and Var2 have a correlation of 1 when looking at a single entity. However, when using the normal correlation method it returns a different result. One of the datasets I am using has 240k data points so this issue will lead to drastically miscalulated results across a large sample. While I could try calculating each correlation within an entity and averaging them I do not think this is the best practice and would like to find a correct method for panel data.