0

I have been trying to determine the correlation between variables in panel data. Using the base cor() function does not account for fixed effects. My data is in the form (with more entities, years and variables, some values of x are 0):

entity year var1 var2 var3
1 2000 x x x
1 2001 x x x
1 2002 x x x
2 2000 x x x
2 2001 x x x
2 2002 x x x
3 2000 x x x
3 2001 x x x
3 2002 x x x

I have tried using the plm package and the cortab function, but it appears to find the correlation between groups of entities for the same variable. Other solutions I have found online don't seem to calulcate the correlation correctly.

The output should look like:

var1 var2 var3
var1 x x x
var2 x x x
var3 x x x

The data I am using is balanced, the plan is to use the script on a variety of datasets, a different script removes non-numeric values and will ensure it is in this format.

Example of simple method

Most correlation method will just find the correlation between two variable columns. However, this can lead to miscalculations for my purpose. In the screeenshot Var1 and Var2 have a correlation of 1 when looking at a single entity. However, when using the normal correlation method it returns a different result. One of the datasets I am using has 240k data points so this issue will lead to drastically miscalulated results across a large sample. While I could try calculating each correlation within an entity and averaging them I do not think this is the best practice and would like to find a correct method for panel data.

howkesh
  • 36
  • 3
  • IMHO, this question lacks clarity. Stack overflow is a website designed to ask programming questions. Here, we cannot answer a programming question because the task is not specified in sufficient detail. You should tell us **exactly** what mathematical procedure you want to apply to the data, and **exactly** what the returned values should be. For example, "account for fixed effects" is extremely vague. Ideally, you would also supplied a minimal working example. Please read this link: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610 – Vincent May 11 '22 at 12:03

1 Answers1

0

You could use modelsummary::datasummary_correlation() for that, e.g.

df %>% select(var1:var3) %>% modelsummary::datasummary_correlation()
Julian
  • 6,586
  • 2
  • 9
  • 33
  • This solution gave me values equal to simply running cor() on the dataset. The cor() function calculates correlation of df$var1 against df$var2 etc. This means the correlation will be miscalculated when the entity changes (not capturing entity effect). The method would act as if it is taking the correlation between the variables for each entity and then combining them into a final value. – howkesh May 11 '22 at 09:55
  • Do you want a correlation matrix for each entity for one variable? I do not completely understand A,B,C in your output dataframe. – Julian May 11 '22 at 10:14
  • The output of the function you commented is the format I am looking for. But the function appears to calculate the correlation by 'running down' the vector of two variables. However, as the values are for different entities, this method of calculating correlation will provide an incorrect correlation because the variables' value "jumps" when it goes from one entity to another. I have attached an example to the bottom of my original question. – howkesh May 11 '22 at 10:25