11

I'm trying to calculate the Pearson correlation coefficient of two variables. These variables are to determine if there is a relationship between number of postal codes to a range of distances. So I want to see if the number of postal codes increases/decreases as the distance ranges changes.

I'll have one list which will count the number of postal codes within a distance range and the other list will have the actual ranges.

Is it ok to have a list that contain a range of distances? Or would it be better to have a list like this [50, 100, 500, 1000] where each element would then contain ranges up that amount. So for example the list represents up to 50km, then from 50km to 100km and so on.

Salvador Dali
  • 214,103
  • 147
  • 703
  • 753
user94628
  • 3,641
  • 17
  • 51
  • 88
  • 4
    @Krab Removed unnecessary information inline with SO policy, SO is a question and answer site so saying I would appreciate help is redundant, to say thanks you upvote and accept answer.. if you want more information on this read the faq and dig around on meta stackoverflow – Chris Seymour Nov 30 '12 at 16:08

4 Answers4

16

Use scipy :

scipy.stats.pearsonr(x, y)

Calculates a Pearson correlation coefficient and the p-value for testing non-correlation.

The Pearson correlation coefficient measures the linear relationship between two datasets. Strictly speaking, Pearson’s correlation requires that each dataset be normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.

The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so.

Parameters :

x : 1D array

y : 1D array the same length as x

Returns :

(Pearson’s correlation coefficient, : 2-tailed p-value)

MojiProg
  • 1,992
  • 1
  • 16
  • 8
lucasg
  • 10,734
  • 4
  • 35
  • 57
  • 2
    Ok, so what matters more is that both the x and y arrays are of the same length. Then you are comparing elements x[i] with element y[i]? – user94628 Nov 30 '12 at 16:43
  • 1
    yep. In your case, x should be equal to the distances considered, and y[i] should return the number of postal code at distances[i]. To see the actual computation for the Pearson : http://stackoverflow.com/questions/3949226/calculating-pearson-correlation-and-significance-in-python – lucasg Nov 30 '12 at 16:49
  • Cool, so x[i] could mean up to that distance? – user94628 Nov 30 '12 at 16:52
  • Yes, x[i] could mean up to that distance. If all the distances are computed from a particular starting point, then x[i] is just an area of that distance, and the corresponding y[i] would be how many postal codes are covered in that area. – Antimony Nov 14 '15 at 22:08
  • Make sure that the arrays x and y have a mean of 0. Otherwise you will get an incorrect value. – DollarAkshay Jun 14 '18 at 08:26
7

You can also use numpy:

numpy.corrcoef(x, y)

which would give you a correlation matrix that looks like:

[[1          correlation(x, y)]
[correlation(y, x)          1]]
Antimony
  • 2,230
  • 3
  • 28
  • 38
0

try this:

 val=Top15[['Energy Supply per Capita','Citable docs per Capita']].rank().corr(method='pearson')
Shaurya
  • 136
  • 1
  • 4
  • 20
0

In Python 3.10 correlation() function was added to the statistics module of the Python standard library, it can be directly used by importing the statistics module:

import statistics

statistics.correlation(words, views)
Cem Önel
  • 721
  • 6
  • 8