After running a logistic regression algorithm on a dataset (n = 100 000), I would like to get a correlation matrix of the features.
Here is a preview of my data:
results.columns
res16: Array[String] = Array(label, Pclass, Sex, Age, SibSp, Parch, Fare, Embarked, SexIndex, EmbarkIndex, SexVec, EmbarkVec, features, rawPrediction, probability, prediction)
scala> val fts = results.select("features")
res19: org.apache.spark.sql.DataFrame = [features: vector]
scala> results.select("features").show(10)
+--------------------+
| features|
+--------------------+
|[1.0,1.0,19.0,1.0...|
|[1.0,1.0,19.0,3.0...|
|[1.0,1.0,22.0,0.0...|
|[1.0,1.0,24.0,0.0...|
|[1.0,1.0,30.0,0.0...|
|[1.0,1.0,31.0,0.0...|
|[1.0,1.0,31.0,1.0...|
|[1.0,1.0,36.0,1.0...|
|(8,[0,1,2,6],[1.0...|
|[1.0,1.0,46.0,1.0...|
I know that in R I could use this code in order to get the correlation matrix:
res <- rcorr(as.matrix(my_data))
so I tried something similar with Scala:
val corrMatrix = corr(fts)
and got the following error:
<console>:64: error: overloaded method value corr with alternatives:
(columnName1: String,columnName2: String)org.apache.spark.sql.Column <and>
(column1: org.apache.spark.sql.Column,column2: org.apache.spark.sql.Column)org.apache.spark.sql.Column
cannot be applied to (org.apache.spark.sql.DataFrame)
After looking into this error and reading this and this, I think I need to put these arrays into a DF and then iterate through the DF to find a correlation between each pair, i.e something like this pseudocode where (i,j) is an $i-th$ row and the j-th column:
for ( int i = 1, i <= n, i ++){
for( int j = i, <= n, j ++ ){
if( i == j) a(i)(j) = 1
else a(i)(j) = a(j)(i) = corr(i,j) // symmetric matrix
}
}
I am a complete beginner in Scala and Spark so I would really appreciate if someone could help me out.