5

Given a set of variables, x's. I want to find the values of coefficients for this equation:

y = a_1*x_1 +... +a_n*x_n + c

where a_1,a_2,...,a_n are all unknowns. Thinking this in perspective of data frame, I want to create this value of y for every rows in the data.

My question is: for y, a_1...a_n and c are all unknown, is there a way for me to find a set of solutions a_1,...,a_n under the condition that corr(y,x_1), corr(y,x_2) .... corr(y,x_n) are all greater than 0.7. For simplicity take correlation here as Pearson correlation. I know there would no be unique solution. But how can I construct a set of solutions for a_1,...,a_n to fulfill this condition?

Spent a day to search the idea but could not get any information out of it. Any programming language to tackle this problem is welcomed or at least some reference for this.

Spektre
  • 49,595
  • 11
  • 110
  • 380
skw1990
  • 63
  • 6
  • In first glance, it looks like a Lagrange multiplier-based optimisation approach, but there are facets of the question that are not clear to me. Could you clarify it, especially with how the model and data are related? You might have more luck with this on the mathematics site by the way, but come back here with any issues encountered on implementation. – Bathsheba Feb 05 '16 at 08:12
  • Personally I find it weird to do what you want to do. To get more help about statistics/machine learning related questions, the people on http://stats.stackexchange.com/ are happy to help you. – Ruben Feb 05 '16 at 08:29
  • 1
    @Ruben "I find it weird what you want to do" - I suspect these exact words have been used to try and shut down many sensible ideas in the past! – Chris Taylor Feb 05 '16 at 08:48

2 Answers2

4

No, it is not possible in general. It may be possible in some special cases.

Given x₁, x₂, ... you want to find y = a₁x₁ + a₂x₂ + ... + c so that all the correlations between y and the x's are greater than some target R. Since the correlation is

Corr(y, xi) = Cov(y, xi) / Sqrt[ Var(y) * Var(xi) ]

your constraint is

Cov(y, xi) / Sqrt[ Var(y) * Var(xi) ] > R

which can be rearranged to

Cov(y, xi)² > R² * Var(y) * Var(xi)

and this needs to be true for all i.

Consider the simple case where there are only two columns x₁ and x₂, and further assume that they both have mean zero (so you can ignore the constant c) and variance 1, and that they are uncorrelated. In that case y = a₁x₁ + a₂x₂ and the covariances and variances are

Cov(y, x₁) = a₁
Cov(y, x₂) = a₂
Var(x₁)    = 1
Var(x₂)    = 1
Var(y)     = (a₁)² + (a₂)²

so you need to simultaneously satisfy

(a₁)² > R² * ((a₁)² + (a₂)²)
(a₂)² > R² * ((a₁)² + (a₂)²)

Adding these inequalities together, you get

(a₁)² + (a₂)² > 2 * R² * ((a₁)² + (a₂)²)

which means that in order to satisfy both of the inequalities, you must have R < Sqrt(1/2) (by cancelling common factors on both sides of the inequality). So the very best you could do in this simple case is to choose a₁ = a₂ (the exact value doesn't matter as long as they are equal) and both of the correlations Corr(y,a₁) and Corr(y,a₂) will be equal to 0.707. You cannot achieve correlations higher than this between y and all of the x's simultaneously in this case.

For the more general case with n columns (each of which has mean zero, variance 1 and zero correlation between columns) you cannot simultaneously achieve correlations greater than 1 / sqrt(n) (as pointed out in the comments by @kazemakase).

In general, the more independent variables there are, the lower the correlation you will be able to achieve between y and the x's. Also (although I haven't mentioned it above) the correlations between the x's matter. If they are in general positively correlated, you will be able to achieve a higher target correlation between y and the x's. If they are in general uncorrelated or negatively correlated, you will only be able to achieve low correlations between y and the x's.

Chris Taylor
  • 46,912
  • 15
  • 110
  • 154
  • May be worth to add that the lower bound for the correlation is `1/sqrt(n)`, which is the theoretical correlation of y and n completely uncorrelated x's. – MB-F Feb 05 '16 at 08:49
  • Thanks. Nothing I can think of to dispute against your approach. Kudos to that and hence I think I should save myself some times from thinking a way to crack this problem =D – skw1990 Feb 07 '16 at 18:00
0

I am not expert in this field so read with extreme prejudice!

  1. I am a bit confused by your y

    Your y is a single constant and you want to have the correlation between it and all the x_i values be > 0.7 ? I am no math/statistics expert but my feelings for this are that this is achievable only if the correlation between x_i,x_j upholds the same condition. in that case you can simply do the average of x_i like this:

    y=(x_1+x_2+x_3+...+x_n)/n
    

    so the a_i=1.0/n and c=0.0 But still the question is:

    What meaning has a correlation between 2 numbers only?

  2. More reasonable would be if y is a function dependent on x

    for example like this:

    y(x) = a_1*(x-x_1)+... +a_n*(x-x_n) + c
    

    or any other equation (hard to make any without knowing where it came from and for what purpose). Then you can compute the correlation between two sets

    X = {   x_1 ,  x_2 ,..., x_n  }
    Y = { y(x_1),y(x_2),...y(x_n) }
    

    In that case I would give try approximation search for the c,a_i constants to maximize correlation between X,Y, but the results complexity for the whole thing would be insane. So instead I would tweak just one constant. at the time

    1. set some safe c,a_1,a_2,... constants
    2. tweak a_1

      so compute correlation for (a_1-delta) and (a_1+delta) and then choose the direction which is in favor of correlation. then keep going in that direction until the correlation coefficient start to drop.

      Then you can recursively to this again with smaller delta. Btw this is exactly what my approx class does from the link above.

    3. loop #2 through all the a_i

    4. loop this whole few times to enhance precision

    May be you could compute the c after each run to minimize the distance between X,Y sets.

Community
  • 1
  • 1
Spektre
  • 49,595
  • 11
  • 110
  • 380
  • You should think of each xi as a column of numbers (one for each row in a data set). Or similarly, think of each xi as a random variable. – Chris Taylor Feb 05 '16 at 13:43
  • @ChrisTaylor so something like: `y(i)=a1*x1(i)+a2*x2(i)+...an*xn(i)` ? Where `i` is the row or column. that actually make much more sense but the **#1** still applies ... `corr(Xi,Xj)>0.7` so you can do the average instead ... or set a1=1.0 and all the others to zero ... – Spektre Feb 05 '16 at 13:46
  • @ChrisTaylor but your answer states more ore less the same in more detailed way ... – Spektre Feb 05 '16 at 13:54