How to use AND with ANY in NSPredicates while staying efficient

Question

Imagine I have a collection of books in Core Data. Each book can have multiple authors. Here's what this imaginary collection of books could look like (it only has one book, just to simplify things):

[
    Book(authors: [
        Author(first: "John", last: "Monarch"),
        Author(first: "Sarah", last: "Monarch")
    ])
]

I want to filter my collection of books down to only those that have an author whose name is "Sarah Monarch".

From what I've seen so far, if I wanted to write an NSPredicate to filter my collection and return a filtered collection that only contains this book, I could do it by using a SUBQUERY:

NSPredicate(format: "SUBQUERY(authors, $author, $author.#first == 'Sarah' && $author.#last == 'Monarch').@count > 0")

My understanding is that this operation is essentially the same as:

books.filter {
    book in
    
    let matchingAuthors = book.authors.filter {
        author in
        
        author.first == "John" && author.last == "Monarch"
    }
    
    return matchingAuthors.count > 0
}

My problem is that it seems there's some inefficiency here — SUBQUERY (and the example code above) will look at all authors, when we can stop after finding just one that matches. My intuition would lead me to try a predicate like:

ANY (authors.#first == "Sarah" && authors.#last == "Monarch")

Which, as code, could be:

books.filter {
    book in
    
    return book.authors.contains {
        author in
        
        author.first == "John" && author.last == "Monarch"
    }
}

But this predicate's syntax isn't valid.

If I'm right, and the SUBQUERY-based approach is less efficient (because it looks at all elements in a collection, rather than just stopping at the first match) is there a more correct and efficient way to do this?

@Willeke This is a good question! The main reason is that when dealing with Core Data, my understanding is that predicates are *usually* faster and more efficient than loading the objects into memory and filtering over them, because predicates can operate directly on the underlying storage (like an SQL file). There are other reasons too, though — mattsven, Jul 09 '21 at 16:38
@JoakimDanielson Yeah that's definitely an option (assuming I have the kind of relationship you're suggesting set up), but I'm still curious to know if there's a way to do it by filtering directly on Book entities. — mattsven, Jul 09 '21 at 16:59
Do you want to filter an arrary or do you want to use the predicate in a fetch? — Willeke, Jul 10 '21 at 07:34
I wouldn't worry about any inefficiency in using SUBQUERY, unless and until you actually have a real performance issue. Stick with the first predicate you constructed. In practice the time consumed in scanning all the authors for a given book, v stopping at the first match, might be inconsequential. There are other ways to do it, but whether they are more or less efficient will depend on lots of factors. — pbasdf, Jul 10 '21 at 12:39
@pbasdf That's a good reminder — I should actually profile to see what the performance penalties of `SUBQUERY` actually are. But I'm still curious to know if there's another way to do this just in case so I have an alternative as a backup, especially since in my case, there can be many authors (I'm talking thousands or more) — mattsven, Jul 10 '21 at 14:03
@mattsven I’ll post an alternative later. Do you have the inverse relationship set up? Is it one-many or many-many? — pbasdf, Jul 10 '21 at 15:26
@pbasdf This SO question is partly hypothetical (I'm writing a library to make building predicates easier) so the answer to your question is... maybe? Ideally yes, an inverse relationship will exist, but I am curious if that's my only option, or if there's a way to do this with predicates that I'm not aware of — mattsven, Jul 10 '21 at 18:17
It depends on how you are going to use the predicate. One other way is to fetch author `"Sarah Monarch" and predicate `"ANY authors == %@", sarah_monarch_author`. Add some requirements to the question please, how universal is the predicate? SQL compatible? — Willeke, Jul 11 '21 at 08:57

score 2 · Accepted Answer · answered Jul 11 '21 at 12:05

The following is an alternative approach, which does not use SUBQUERY, but should ultimately have the same results. I've no idea whether this would in practice be more or less efficient than using SUBQUERY.

Your specific concern is the inefficiency of counting all the matching Authors, rather than just stopping when the first matching Author is found. Bear in mind that there is a lot going on behind the scenes whenever a fetch request is processed. First, CoreData has to parse your predicate and turn it into an equivalent SQLite query, which will look something like this:

SELECT 0, t0.Z_PK, t0.Z_OPT, t0.ZNAME, ... FROM ZBOOK t0 WHERE (SELECT COUNT(t1.Z_PK) FROM ZAUTHOR t1 WHERE (t0.Z_PK = t1.ZBOOK AND ( t1.ZFIRST == 'Sarah' AND  t1.ZLAST == 'Monarch')) ) > 0

(The precise query will depend on whether the inverse relationship is defined and if so whether it is one-many or many-many; the above is based on the relationship having a to-one inverse, book).

When SQLite is handed that query to execute, it will check what indexes it has available and then invoke the query planner to determine how best to process it. The bit of interest is the subselect to get the count:

SELECT COUNT(t1.Z_PK) FROM ZAUTHOR t1 WHERE (t0.Z_PK = t1.ZBOOK AND ( t1.ZFIRST == 'Sarah' AND  t1.ZLAST == 'Monarch'))

This is the bit of the query which corresponds to your first code snippet. Note that it is in SQLite terms a correlated subquery: it includes a parameter (t0.Z_PK - this is essentially the relevant Book) from the outer SELECT statement. SQLite will search the entire table of Authors, looking first to see whether they are related to that Book and then to see whether the author first and last names match. Your proposition is that this is inefficient; the nested select can stop as soon as any matching Author is found. In SQLite terms, that would correspond to a query like this:

SELECT 0, t0.Z_PK, t0.Z_OPT, t0.ZNAME, ... FROM ZBOOK t0 WHERE EXISTS(SELECT 1 FROM ZAUTHOR t1 WHERE (t0.Z_PK = t1.ZBOOK AND ( t1.ZFIRST == 'Sarah' AND  t1.ZLAST == 'Monarch')) )

(It's unclear from the SQLite docs whether the EXISTS operator actually shortcuts the underlying subselect, but this answer elsewhere on SO suggests it does. If not it might be necessary to add "LIMIT 1" to the subselect to get it to stop after one row is returned). The problem is, I don't know of any way to craft a CoreData predicate which would be converted into a SQLite query using the EXISTS operator. Certainly it's not listed as a function in the NSExpression documentation, nor is it mentioned (or listed as a reserved word) in the predicate format documentation. Likewise, I don't know of any way to add "LIMIT 1" to the subquery (though it's relatively straightforward to add to the main fetch request using fetchLimit.

So not much scope for addressing the issue you identify. However, there might be other inefficiencies. Scanning the Author table for each Book in turn (the correlated subquery) might be one. Might it be more efficient to scan the Author table once, identify those that meet the relevant criteria (first = "Sarah" and last = "Monarch"), then use that (presumably much shorter) list to search for the books? As I said at the start, that's an open question: I have no idea whether it is or isn't more efficient.

To pass results of one fetch request to another, use NSFetchRequestExpression. It's a bit arcane but hopefully the following code is clear enough:

    let authorFetch = Author.fetchRequest()
    authorFetch.predicate = NSPredicate(format: "#first == 'Sarah' AND #last == 'Monarch'")
    authorFetch.resultType = .managedObjectIDResultType
    let contextExp = NSExpression(forConstantValue: self.managedObjectContext)
    let fetchExp = NSExpression(forConstantValue: authorFetch)
    let fre = NSFetchRequestExpression.expression(forFetch: fetchExp, context: contextExp, countOnly: false)
    let bookFetch = Book.fetchRequest()
    bookFetch.predicate = NSPredicate(format: "ANY authors IN %@", fre)
    let results = try! self.managedObjectContext!.fetch(bookFetch)

The result of using that fetch is a SQL query like this:

SELECT DISTINCT 0, t0.Z_PK, t0.Z_OPT, t0.ZNAME, ... FROM ZBOOK t0 JOIN ZAUTHOR t1 ON t0.Z_PK = t1.ZBOOK WHERE  t1.Z_PK IN (SELECT n1_t0.Z_PK FROM ZAUHTOR n1_t0 WHERE ( n1_t0.ZFIST == 'Sarah' AND  n1_t0.ZLAST == 'Monarch')

This has its own complexities (a DISTINCT, a JOIN, and a subselect) but importantly the subselect is no longer correlated: it is independent of the outer SELECT so can be evaluated once rather than being re-evaluated for each row of the outer SELECT.

This is great; thank you for going into detail about the underlying aspects. — mattsven, Jul 19 '21 at 00:10

How to use AND with ANY in NSPredicates while staying efficient

1 Answers1