I have the following code which basically is doing feature engineering pipeline:
token_q1=Tokenizer(inputCol='question1',outputCol='question1_tokens')
token_q2=Tokenizer(inputCol='question2',outputCol='question2_tokens')
remover_q1=StopWordsRemover(inputCol='question1_tokens',outputCol='question1_tokens_filtered')
remover_q2=StopWordsRemover(inputCol='question2_tokens',outputCol='question2_tokens_filtered')
q1w2model = Word2Vec(inputCol='question1_tokens_filtered',outputCol='q1_vectors')
q1w2model.setSeed(1)
q2w2model = Word2Vec(inputCol='question2_tokens_filtered',outputCol='q2_vectors')
q2w2model.setSeed(1)
pipeline=Pipeline(stages[token_q1,token_q2,remover_q1,remover_q2,q1w2model,q2w2model])
model=pipeline.fit(train)
result=model.transform(train)
result.show()
I want to add the following UDF to this above pipeline:
charcount_q1 = F.udf(lambda row : sum([len(char) for char in row]),IntegerType())
When I do that, I get Java error. Can someone point me to the right direction?
However, I was adding this column using the following code which basically works:
charCountq1=train.withColumn("charcountq1", charcount_q1("question1"))
But I want to add it into a pipleline rather than doing this way