I have a dataframe and after adding a rank column i can split it to several dataframes based on th number of ranks :
rankedDF :
job_id | task_id | rating | proba | rank |
---|---|---|---|---|
1 | 111 | 1 | 0.7 | 1 |
1 | 111 | 2 | 0.3 | 1 |
1 | 122 | 4 | 0.9 | 2 |
1 | 122 | 7 | 0.1 | 2 |
1 | 133 | 3 | 0.6 | 3 |
1 | 133 | 1 | 0.4 | 3 |
To create multiple dataframes :
val numberRanks = rankedDF.select("rank").distinct().count().toInt
// create multiple dataframe
val rankDFs = for (i <- 1 to numberRanks) yield {
rankedDF.filter(col("rank") === i)
}
Then I join between the dataframes and create arrays to combine rating between them, task_id between them and multiply proba between thems :
// join dataframes between them
val joinedDFs = rankDFs.reduce((df1, df2) =>
df1.join(df2, Seq("job_id"))
.withColumn("combination_ratings", array(col("rating"), col("rating"))
.withColumn("combination_task", array(col("task_id"), col("task_id"))
.withColumn("final_proba", col("proba") * col("proba"))
).select("job_id", "combination_task", "combination_ratings", "final_proba")
The intermediate result just after the join and before creation of the combination arrays is :
job_id | task_id | rating | proba | task_id | rating | proba | task_id | rating | proba |
---|---|---|---|---|---|---|---|---|---|
1 | 111 | 1 | 0.1 | 122 | 3 | 0.7 | 133 | 3 | 0.6 |
1 | 111 | 2 | 0.3 | 122 | 4 | 0.4 | 133 | 1 | 0.2 |
After combination the result should be somthing like this :
job_id | combination_task | combination_ratings | final_proba |
---|---|---|---|
1 | [111, 122, 133] | [1, 4, 3] | 0.378 |
1 | [111, 122, 133] | [2, 7, 1] | 0.012 |
But I get error :
reference 'rating' is ambiguous, could be: rating, rating
Ps : I also tried to aliasing dataframes in the join expression but the error was same