3

I have the following code which basically is doing feature engineering pipeline:

token_q1=Tokenizer(inputCol='question1',outputCol='question1_tokens') 
token_q2=Tokenizer(inputCol='question2',outputCol='question2_tokens')  

remover_q1=StopWordsRemover(inputCol='question1_tokens',outputCol='question1_tokens_filtered')
remover_q2=StopWordsRemover(inputCol='question2_tokens',outputCol='question2_tokens_filtered')

q1w2model = Word2Vec(inputCol='question1_tokens_filtered',outputCol='q1_vectors')
q1w2model.setSeed(1)

q2w2model = Word2Vec(inputCol='question2_tokens_filtered',outputCol='q2_vectors')
q2w2model.setSeed(1)

pipeline=Pipeline(stages[token_q1,token_q2,remover_q1,remover_q2,q1w2model,q2w2model])
model=pipeline.fit(train)
result=model.transform(train)
result.show()

I want to add the following UDF to this above pipeline:

charcount_q1 = F.udf(lambda row : sum([len(char) for char in row]),IntegerType())

When I do that, I get Java error. Can someone point me to the right direction?

However, I was adding this column using the following code which basically works:

charCountq1=train.withColumn("charcountq1", charcount_q1("question1"))

But I want to add it into a pipleline rather than doing this way

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
CyberPunk
  • 1,297
  • 5
  • 18
  • 35

1 Answers1

7

If you want to use udf in Pipeline you'll need one of the following:

The first one is quite verbose for such a simple use case so I recommend the second option:

from pyspark.sql.functions import udf
from pyspark.ml import Pipeline
from pyspark.ml.feature import SQLTransformer

charcount_q1 = spark.udf.register(
    "charcount_q1",
    lambda row : sum(len(char) for char in row),
    "integer"
)

df = spark.createDataFrame(
    [(1, ["spark", "java", "python"])],
    ("id", "question1"))

pipeline = Pipeline(stages = [SQLTransformer(
    statement = "SELECT *, charcount_q1(question1) charcountq1 FROM __THIS__"
)])

pipeline.fit(df).transform(df).show()
# +---+--------------------+-----------+
# | id|           question1|charcountq1|
# +---+--------------------+-----------+
# |  1|[spark, java, pyt...|         15|
# +---+--------------------+-----------+
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115