I am using spark 2.2.0 with python. I tried to figure out what is the default param of Link function Spark accepts in the GeneralizedLineraModel
in case of Tweedie family.
When I look to documentation https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.regression.GeneralizedLinearRegression
class pyspark.ml.regression.GeneralizedLinearRegression(self, labelCol="label", featuresCol="features", predictionCol="prediction", family="gaussian", link=None, fitIntercept=True, maxIter=25, tol=1e-6, regParam=0.0, weightCol=None, solver="irls", linkPredictionCol=None
It seems that default value when family='tweedie' should be None but when I tried this (by using similar test as unit test : https://github.com/apache/spark/pull/17146/files/fe1d3ae36314e385990f024bca94ab1e416476f2) :
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([(1.0, Vectors.dense(0.0, 0.0)),\
(1.0, Vectors.dense(1.0, 2.0)),\
(2.0, Vectors.dense(0.0, 0.0)),\
(2.0, Vectors.dense(1.0, 1.0)),], ["label", "features"])
glr = GeneralizedLinearRegression(family="tweedie",variancePower=1.42,link=None)
model = glr.fit(df)
transformed = model.transform(df)
it raised a Null pointer Java exception
...
Py4JJavaError: An error occurred while calling o6739.w. : java.lang.NullPointerException ...
It works well when I remove explicite link=None in the initilization of the model.
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([(1.0, Vectors.dense(0.0, 0.0)),\
(1.0, Vectors.dense(1.0, 2.0)),\
(2.0, Vectors.dense(0.0, 0.0)),\
(2.0, Vectors.dense(1.0, 1.0)),], ["label", "features"])
glr = GeneralizedLinearRegression(family="tweedie",variancePower=1.42)
model = glr.fit(df)
transformed = model.transform(df)
I would like to be able to pass a standard set of params like
params={"family":"Onefamily","link":"OnelinkAccordingToFamily",..}
and then initialize GLM as:
glr = GeneralizedLinearRegression(family=params["family"],link=params['link]' ....)
So it could be more standard and works in any case of family and link. Just seems that the link value is not ignored in the case when family=Tweedie any idea of what default value I should use? I tried link='' or link='None' but it raises 'invalid link function'.