3

I am using spark 2.2.0 with python. I tried to figure out what is the default param of Link function Spark accepts in the GeneralizedLineraModel in case of Tweedie family.

When I look to documentation https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.regression.GeneralizedLinearRegression

class pyspark.ml.regression.GeneralizedLinearRegression(self, labelCol="label", featuresCol="features", predictionCol="prediction", family="gaussian", link=None, fitIntercept=True, maxIter=25, tol=1e-6, regParam=0.0, weightCol=None, solver="irls", linkPredictionCol=None

It seems that default value when family='tweedie' should be None but when I tried this (by using similar test as unit test : https://github.com/apache/spark/pull/17146/files/fe1d3ae36314e385990f024bca94ab1e416476f2) :

from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([(1.0, Vectors.dense(0.0, 0.0)),\
     (1.0, Vectors.dense(1.0, 2.0)),\
     (2.0, Vectors.dense(0.0, 0.0)),\
     (2.0, Vectors.dense(1.0, 1.0)),], ["label", "features"])
glr = GeneralizedLinearRegression(family="tweedie",variancePower=1.42,link=None)
model = glr.fit(df)
transformed = model.transform(df)

it raised a Null pointer Java exception...

Py4JJavaError: An error occurred while calling o6739.w. : java.lang.NullPointerException ...

It works well when I remove explicite link=None in the initilization of the model.

from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([(1.0, Vectors.dense(0.0, 0.0)),\
     (1.0, Vectors.dense(1.0, 2.0)),\
     (2.0, Vectors.dense(0.0, 0.0)),\
     (2.0, Vectors.dense(1.0, 1.0)),], ["label", "features"])
glr = GeneralizedLinearRegression(family="tweedie",variancePower=1.42)
model = glr.fit(df)
transformed = model.transform(df)

I would like to be able to pass a standard set of params like

params={"family":"Onefamily","link":"OnelinkAccordingToFamily",..}

and then initialize GLM as:

 glr = GeneralizedLinearRegression(family=params["family"],link=params['link]' ....)

So it could be more standard and works in any case of family and link. Just seems that the link value is not ignored in the case when family=Tweedie any idea of what default value I should use? I tried link='' or link='None' but it raises 'invalid link function'.

www
  • 38,575
  • 12
  • 48
  • 84
Antoine Ly
  • 33
  • 1
  • 7

1 Answers1

0

To deal with GLR tweedie family you'll need to define the power link function specified through the "linkPower" parameter, and you shouldn't set link to None which was leading to that exception you got.

Here is an example on how to use it :

df = spark.createDataFrame(
        [(1.0, Vectors.dense(0.0, 0.0)),
         (1.0, Vectors.dense(1.0, 2.0)),
         (2.0, Vectors.dense(0.0, 0.0)),
         (2.0, Vectors.dense(1.0, 1.0)), ], ["label", "features"])

# in this case the default link power applies
glr = GeneralizedLinearRegression(family="tweedie", variancePower=1.6)

model = glr.fit(df) # in this case the default link power applies

model2 = glr.setLinkPower(-1.0).fit(df)

PS : The default link power in the tweedie family is 1 - variancePower.

eliasah
  • 39,588
  • 11
  • 124
  • 154
  • 1
    Ok so it seems I cannot use something like `glr = GeneralizedLinearRegression(family=params["family"],link=params['link]' ....)` but I should remove 'link' item in my dictionary `params`and call it with `GeneralizedLinearRegression(**params)` . Thank you! – Antoine Ly Oct 25 '17 at 12:14
  • What is the difference between `variancePower` and `linkPower`? The docs are extremely obtuse here. Which one is the `p` in the Tweedie distribution? – Evan Zamir Jan 25 '22 at 20:08
  • I tried running this code and just get an error `py4j.protocol.Py4JJavaError: An error occurred while calling o93.toString. : java.util.NoSuchElementException: Failed to find a default value for link` – Evan Zamir Jan 25 '22 at 21:12
  • @EvanZamir the error you have mentioned isn't related necessary to the code I have provided. You have another issue with your pyspark environment https://stackoverflow.com/questions/51952535/pyspark-error-py4jjavaerror-an-error-occurred-while-calling-o655-count-when – eliasah Jan 26 '22 at 09:16