1

I have a spark dataframe(df) consisting of features on which my model is trained on and 3 extra features ID, label and status(these 3 are not used in my model training). I'm using a model registered on mlflow to predict on these columns but before I send to prediction I'm removing 3 columns i.e. ID, label and status and storing it in a intermediate dataframe.

Now after passing df through the registered mlflow model I get a new column(Probability) in result_df having my probabilities(This is a binary classification model).

Now I want add back these probabilities to the original df without changing the order i.e. the ID, label and status should get assigned to the corresponding predicted probability.

My final table should have ID,label, status and it's probability. NOTE: df and result_df doesn't have any common column to perform the join. SO I've used monotically_increasing_id() to perform the join

I've tried below codes:

df=df.withColumn('index',montonically_increasing_id())
result_df=result_df.withColumn('index',montonically_increasing_id())

final_df=df.join(result_df,'index','left')

When I do left join, in the resultant dataframe final_df there are nulls in ID column.

And df.select('ID','label','status').exceptAll(final_df.select('ID','label','status')) there are lot of different rows.

But when I do left outer join:

final_df=df.join(result_df,'index','left_outer') df.select('ID','label','status').exceptAll(final_df.select('ID','label','status')) seems fine but there are null values in Probability column.

Please let me know where am I going wrong.

anaktha
  • 21
  • 2
  • Monotonically increaing ID won't work as both will have different IDs, know why - https://stackoverflow.com/questions/48209667/using-monotonically-increasing-id-for-assigning-row-number-to-pyspark-datafram – Ronak Jain May 05 '23 at 19:47

0 Answers0