2

I am trying to build a query to match two columns and I have tried the following:

obj= obj.filter(e => e.colOne.exactMatch(e.colTwo))

I am not be able to get this working, is there any way to filter by comparing the content of 2 columns?

2 Answers2

0

The filter() method can't dynamically grab the value to filter based on each object, but can be used to filter on a static value.

You can filter a smaller object set (<100K rows) named myUnfilteredObjects of type ObjectType this way:

let myFilteredObjects = new Set<ObjectType>();

for (const unfilteredObj of myUnfilteredObjects) {
    if (unfilteredObj.colOne === unfilteredObj.colTwo) {
        myFilteredObjects.add(unfilteredObj);
    }
}

Edit: updating with a solution for larger-scale object sets:

You can create a new boolean column in your object's underlying dataset that is true if colOne and colTwo match, and false otherwise. Filtering on this new column via the filter() method will then work as you expect.

Adil B
  • 14,635
  • 11
  • 60
  • 78
  • 1
    Just be cognizant that this approach will require loading all the objects into the function executor memory, so it will most likely only work on smaller scales. I'd guess somewhere in the 10k - 100k scale depending on the size of each object. Your other option would be - assuming that the relevant columns _are not_ getting edited - to add a new boolean property that's true when they match and false otherwise, which would make for quick filtering. – Logan Rhyne Feb 11 '22 at 14:12
  • actually I am building the query and dealing with objectSet() So below beginning of my code obj = Objects.search().test() then, if I apply your solution I got an error Type 'ObjectSet' must have a '[Symbol.iterator]()' method that returns an iterator. – houssem gharsalli Feb 11 '22 at 14:13
  • The `filter()` method can't dynamically grab the value to filter based on each object, but can be used to filter on a static value. I recommend Logan's second idea above, which is to create a new boolean column in your object's underlying dataset that is `true` if `colOne` and `colTwo` match, and `false` otherwise. Filtering on that column via the `filter()` method will then work as you expect – Adil B Feb 11 '22 at 14:21
  • 1
    thank you both I am gonna try second solution from Logan since my query is very large. – houssem gharsalli Feb 11 '22 at 14:24
0

It is not possible to compare two columns when writing Functions. A recommended strategy here would be to create a new column that captures your equality. For example in your pyspark pipeline, right before you generate the end objects that get indexed:

df.withColumn("colOneEqualsColTwo", F.when(
     F.col("colOne") == F.col("colTwo"), True
).otherwise(False)

And then filter on that new column:

obj = obj.filter(e => e.colOneEqualsColTwo.exactMatch(true))
fmsf
  • 36,317
  • 49
  • 147
  • 195