My question is not a duplicate of [Joining Spark Dataframes with "isin" operator. My question is about "not in", not "is in". It is DIFFERENT!
I have two Dataset
s:
userProfileDataset
:Dataset[UserProfile]
jobModelsDataset
:Dataset[JobModel]
Case clss UserProfile
is defined as
case class UserProfile(userId: Int, visitedJobIds: Array[Int])
and case class JobModel
is defined as
case class JobModel(JobId: Int, Model: Map[String, Double])
I have also made two objects (UserProfileFieldNames
and JobModelFieldNames
) that contains the field names of these two case classes.
My objective is, for each user in userProfileDataset
, find the JobModel.JobId
s that are NOT contained in UserProfile.visitedJobIds
.
How to do this?
I've thought about using a crossJoin
and then filter
. It may work. Is there more direct or efficient ways?
I have tried the following approaches, but none of them worked:
val result = userProfileDataset.joinWith(jobModelsDataset,
!userProfileDataset.col(UserProfileFieldNames.visitedJobIds).contains(jobModelsDataset.col(JobModelFieldNames.jobId)),
"left_outer"
)
It leads to:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'contains(
_1
.visitedJobIds
, CAST(_2
.JobId
AS STRING))' due to data type mismatch: argument 1 requires string type, however, '_1
.visitedJobIds
' is of array type.;;
Could it be because the contains
method can be only used for testing whether one string contains another string?
The following condition also didn't work:
!jobModelsDataset.col(JobModelFieldNames.jobId)
.isin(userProfileDataset.col(UserProfileFieldNames.visitedJobIds))
It leads to:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '(
_2
.JobId
IN (_1
.visitedJobIds
))' due to data type mismatch: Arguments must be same type but were: IntegerType != ArrayType(IntegerType,false);; 'Join LeftOuter, NOT _2#74.JobId IN (_1#73.visitedJobIds)