Have dataframe with text column CALL_TRANSCRIPT (string format) and pii_allmethods column (array of string). Trying to search Call_Transcripts for strings in array & mask using pyspark pandas udf. Getting outputted more than input rows errors. Tried couple of ways to troubleshoot , but not able to resolve.
Inner for loop is to go through pii_list array and replace call_transcript (text variable) with mask value. yield is after inner loop is done , so not clear why it would return more rows than input
NOTE: I have Spark UDF which is working , for performance improvements trying pandas udf
dfs = dfs.withColumn('FULL_TRANSCRIPT', pu_mask_all_pii(col("CALL_TRANSCRIPT"),
col("pii_allmethods")))
**Python UDF function :**
@pandas_udf("string")
def pu_mask_all_pii(iterator: Iterator[Tuple[pd.Series, pd.Series]]) ->
Iterator[pd.Series]:
for text, pii_list in iterator:
pii_list = sorted(pii_list,key=len, reverse=True)
strtext = str(text)
for pii in pii_list:
if len(pii) > 1:
mask = len(pii) * 'X'
strtext = str(re.sub(re.escape(pii), mask,strtext.encode(),flags=re.IGNORECASE))
yield strtext
**PythonException:** An exception was thrown from a UDF: 'AssertionError: Pandas
SCALAR_ITER UDF outputted more rows than input rows.'. Full traceback below: