Prolog to SQL: Any way to improve SQL code for unit tests and fix an edge case elegantly?

Question

Inspired by this StackOverflow question:

Find mutual element in different facts in swi-prolog

We have the following

Problem statement

Given a database of "actors starring in movies" (starsin is the relation linking actor "bob" to movie "a" for example)
starsin(a,bob).
starsin(c,bob).

starsin(a,maria).
starsin(b,maria).
starsin(c,maria).

starsin(a,george).
starsin(b,george).
starsin(c,george).
starsin(d,george).
And given set of movies M, find those actors that starred in all the movies of M.

The question was initially for Prolog.

Prolog solution

In Prolog, an elegant solution involves the predicate setof/3, which collects possible variable instantiations into a set (which is really list without duplicate values):

actors_appearing_in_movies(MovIn,ActOut) :-
    setof(
        Ax,
        MovAx^(setof(Mx,starsin(Mx,Ax),MovAx), subset(MovIn,MovAx)),
        ActOut
    ).

I won't go into details about this, but let's look at the test code, which is of interest here. Here are five test cases:

actors_appearing_in_movies([],ActOut),permutation([bob, george, maria],ActOut),!. 
actors_appearing_in_movies([a],ActOut),permutation([bob, george, maria],ActOut),!.
actors_appearing_in_movies([a,b],ActOut),permutation([george, maria],ActOut),!.
actors_appearing_in_movies([a,b,c],ActOut),permutation([george, maria],ActOut),!.
actors_appearing_in_movies([a,b,c,d],ActOut),permutation([george],ActOut),!.

A test is a call to the predicate actors_appearing_in_movies/2, which is given the input list of movies (e.g. [a,b]) and which captures the resulting list of actors in ActOut.

Subsequently, we just need to test whether ActOut is a permutation of the expected set of actors, hence for example:

permutation([george, maria],ActOut)`

"Is ActOut a list that is a permutation of the list [george,maria]?.

If that call succeeds (think, doesn't return with false), the test passes.

The terminal ! is the cut operator and is used to tell the Prolog engine to not reattempt to find more solutions, because we are good at that point.

Note that for the empty set of movies, we get all the actors. This is arguably correct: every actors stars in all the movies of the empty set (Vacuous Truth).

Now in SQL.

This problem is squarely in the domain of relational algebra, and there is SQL, so let's have a go at this. Here, i'm using MySQL.

First, set up the facts.

DROP TABLE IF EXISTS starsin;

CREATE TABLE starsin (movie CHAR(20) NOT NULL, actor CHAR(20) NOT NULL);

INSERT INTO starsin VALUES
   ( "a" , "bob" ),
   ( "c" , "bob" ),
   ( "a" , "maria" ),
   ( "b" , "maria" ),
   ( "c" , "maria" ),
   ( "a" , "george" ),
   ( "b" , "george" ),
   ( "c" , "george" ),
   ( "d",  "george" );

Regarding the set of movies given as input, giving them in the form of a (temporary) table sounds natural. In MySQL, "temporary tables" are local to the session. Good.

DROP TABLE IF EXISTS movies_in;
CREATE TEMPORARY TABLE movies_in (movie CHAR(20) NOT NULL);
INSERT INTO movies_in VALUES ("a"), ("b");

Approach:

The results can now be obtained by getting, for each actor, the intersection of the set of movies denoted by movies_in and the set of movies in which an actor ever appeared (created for each actor via the inner join), then counting (for each actor) whether the resulting set has at least as many entries as the set movies_in.

Wrap the query into a procedure for practical reasons. A delimiter is useful here:

DELIMITER $$

DROP PROCEDURE IF EXISTS actors_appearing_in_movies;

CREATE PROCEDURE actors_appearing_in_movies()
BEGIN

SELECT 
     d.actor 
   FROM 
     starsin d, movies_in q
   WHERE 
     d.movie = q.movie 
   GROUP BY 
     actor 
   HAVING 
     COUNT(*) >= (SELECT COUNT(*) FROM movies_in);

END$$

DELIMITER ;

Run it!

Problem A appears:

Is there a better way than edit + copy-paste table creation code, issue a CALL and check the results "by hand"?

DROP TABLE IF EXISTS movies_in;
CREATE TEMPORARY TABLE movies_in (movie CHAR(20) NOT NULL);
CALL actors_appearing_in_movies();

Empty set!

Problem B appears:

The above is not desired, I want "all actors", same as for the Prolog solution. As I do not want to tack a weird edge case exception onto the code, my approach must be wrong. Is there one which naturally covers this case but doesn't become too complex? T-SQL and PostgreSQL one-liners are fine too!

The other test cases yield expected data:

DROP TABLE IF EXISTS movies_in;
CREATE TEMPORARY TABLE movies_in (movie CHAR(20) NOT NULL);
INSERT INTO movies_in VALUES ("a"), ("b");
CALL actors_appearing_in_movies();
+--------+
| actor  |
+--------+
| george |
| maria  |
+--------+

DROP TABLE IF EXISTS movies_in;
CREATE TEMPORARY TABLE movies_in (movie CHAR(20) NOT NULL);
INSERT INTO movies_in VALUES ("a"), ("b"), ("c");
CALL actors_appearing_in_movies();
+--------+
| actor  |
+--------+
| george |
| maria  |
+--------+

DROP TABLE IF EXISTS movies_in;
CREATE TEMPORARY TABLE movies_in (movie CHAR(20) NOT NULL);
INSERT INTO movies_in VALUES ("a"), ("b"), ("c"), ("d");
CALL actors_appearing_in_movies();
+--------+
| actor  |
+--------+
| george |
+--------+

@GuyCoder it's already well advanced. But better keep going before I lose interest again. — David Tonhofer, Mar 09 '20 at 16:32
Maybe what you seek is better done using a No-SQL database. I prefer to use Neo4j. Also I have not yet used Prolog with Neo4j but it is on the list. — Guy Coder, Mar 09 '20 at 16:37
@GuyCoder No way, this is squarely the use case for relational algebra. Relations _are_ facts collections, which is why Datalog (and recursive Datalog) exist. I suppose that if you special needs regarding link databases that can be more efficiently handled by dropping certain features of RDBMS (which can maintain link data just fine), you *could* consider something specialized. Tons of this at the JOOQ blog: https://blog.jooq.org/tag/nosql/ — David Tonhofer, Mar 09 '20 at 23:39

score 1 · Accepted Answer · edited Mar 11 '20 at 14:07

And given set of movies M, find those actors that starred in all the movies of M.

I would use:

select si.actor
from starsin si
where si.movie in (<M>)
group by si.actor
having count(*) = <n>;

If you have to deal with an empty set, then you need a left join:

select a.actor
from actors a left join
     starsin si
     on a.actor = si.actor and si.movie in (<M>)
group by a.actor
having count(si.movie) = <n>;

<n> here is the number of movies in <M>.

Update: The second approach in extended form

create or replace temporary table 
   actor (actor char(20) primary key)
   as select distinct actor from starsin;

select 
   a.actor,
   si.actor,si.movie  -- left in for docu
from 
   actor a left join starsin si
     on a.actor = si.actor 
        and si.movie in (select * from movies_in)
group 
   by a.actor
having
   count(si.movie) = (select count(*) from movies_in);

Then for empty movies_in:

+--------+-------+-------+
| actor  | actor | movie |
+--------+-------+-------+
| bob    | NULL  | NULL  |
| george | NULL  | NULL  |
| maria  | NULL  | NULL  |
+--------+-------+-------+

and for this movies_in for example:

+-------+
| movie |
+-------+
| a     |
| b     |
+-------+

movie here is the top of the group:

+--------+--------+-------+
| actor  | actor  | movie |
+--------+--------+-------+
| george | george | a     |
| maria  | maria  | a     |
+--------+--------+-------+

Thanks Gordon, but I have finally managed an answer that pleases me. — David Tonhofer, Mar 11 '20 at 13:07
@DavidTonhofer . . . I don't understand your comment. Doesn't this elegantly do what you want? — Gordon Linoff, Mar 11 '20 at 13:21
You are right. The first one doesn't work in the edge case of "no movies" but the second one does actually. Yesterday I tested it and it didn't. I see why it works because it's the same as my multi-step solutions, but in one query. I will add an a few lines to the above. — David Tonhofer, Mar 11 '20 at 13:50

David Tonhofer · Answer 2 · 2020-03-11T14:08:47.273

The following solution involves counting and an UPDATE

Writeup here: A Simple Relational Database Operation

We are using MariaDB/MySQL SQL. T-SQL or PL/SQL are more complete.

Note that SQL has no vector data types that can be passed to procedures. Gotta work without that.

Enter facts as table:

CREATE OR REPLACE TABLE starsin 
   (movie CHAR(20) NOT NULL, actor CHAR(20) NOT NULL, 
    PRIMARY KEY (movie, actor));

INSERT INTO starsin VALUES
   ( "a" , "bob" ),
   ( "c" , "bob" ),
   ( "a" , "maria" ),
   ( "b" , "maria" ),
   ( "c" , "maria" ),
   ( "a" , "george" ),
   ( "b" , "george" ),
   ( "c" , "george" ),
   ( "d",  "george" );

Enter a procedure to compute solution and actually ... print it out.

DELIMITER $$

CREATE OR REPLACE PROCEDURE actors_appearing_in_movies()
BEGIN

   -- collect all the actors
   CREATE OR REPLACE TEMPORARY TABLE tmp_actor (actor CHAR(20) PRIMARY KEY)
     AS SELECT DISTINCT actor from starsin;

   -- table of "all actors x (input movies + '--' placeholder)"
   -- (combinations that are needed for an actor to show up in the result)
   -- and a flag indicating whether that combination shows up for real
   CREATE OR REPLACE TEMPORARY TABLE tmp_needed 
     (actor CHAR(20), 
      movie CHAR(20), 
      actual TINYINT NOT NULL DEFAULT 0,
     PRIMARY KEY (actor, movie))
   AS 
     (SELECT ta.actor, mi.movie FROM tmp_actor ta, movies_in mi)
     UNION
     (SELECT ta.actor, "--" FROM tmp_actor ta);

   -- SELECT * FROM tmp_needed;

   -- Mark those (actor, movie) combinations which actually exist
   -- with a numeric 1
   UPDATE tmp_needed tn SET actual = 1 WHERE EXISTS
      (SELECT * FROM starsin si WHERE
             si.actor = tn.actor AND si.movie = tn.movie);

   -- SELECT * FROM tmp_needed;

   -- The result is the set of actors in "tmp_needed" which have as many
   -- entries flagged "actual" as there are entries in "movies_in"

   SELECT actor FROM tmp_needed GROUP BY actor 
      HAVING SUM(actual) = (SELECT COUNT(*) FROM movies_in);

END$$

DELIMITER ;

Testing

There is no ready-to-use unit testing framework for MariaDB, so we "test by hand" and write a procedure, the out of which we check manually. Variadic arguments don't exist, vector data types don't exist. Let's accept up to 4 movies as input and check the result manually.

DELIMITER $$

CREATE OR REPLACE PROCEDURE 
   test_movies(IN m1 CHAR(20),IN m2 CHAR(20),IN m3 CHAR(20),IN m4 CHAR(20))
BEGIN
   CREATE OR REPLACE TEMPORARY TABLE movies_in (movie CHAR(20) PRIMARY KEY);   
   CREATE OR REPLACE TEMPORARY TABLE args (movie CHAR(20));
   INSERT INTO args VALUES (m1),(m2),(m3),(m4); -- contains duplicates and NULLs
   INSERT INTO movies_in (SELECT DISTINCT movie FROM args WHERE movie IS NOT NULL); -- clean
   DROP TABLE args;   
   CALL actors_appearing_in_movies();        
END$$

DELIMITER ;

The above passes all the manual tests, in particular:

CALL test_movies(NULL,NULL,NULL,NULL);

+--------+
| actor  |
+--------+
| bob    |
| george |
| maria  |
+--------+
3 rows in set (0.003 sec)

For example, for CALL test_movies("a","b",NULL,NULL);

First set up the table with all actors against in all the movies in the input set, including the "doesn't exist" movie represented by a placeholder --.

+--------+--------+-------+
| actual | actor  | movie |
+--------+--------+-------+
|      0 | bob    | --    |
|      0 | bob    | a     |
|      0 | bob    | b     |
|      0 | george | --    |
|      0 | george | a     |
|      0 | george | b     |
|      0 | maria  | --    |
|      0 | maria  | a     |
|      0 | maria  | b     |
+--------+--------+-------+

Then mark those rows with a 1 where the actor-movie combination actually exists in starsin.

+--------+--------+-------+
| actual | actor  | movie |
+--------+--------+-------+
|      0 | bob    | --    |
|      1 | bob    | a     |
|      0 | bob    | b     |
|      0 | george | --    |
|      1 | george | a     |
|      1 | george | b     |
|      0 | maria  | --    |
|      1 | maria  | a     |
|      1 | maria  | b     |
+--------+--------+-------+

Finally select an actor for inclusion in the solution if the SUM(actual) is equal to the number of entries in the input movies table (it cannot be larger), as that means that the actor indeed appears in all movies of the input movies table. In the special case where that table is empty, the actor-movie combination table will only contain

+--------+--------+-------+
| actual | actor  | movie |
+--------+--------+-------+
|      0 | bob    | --    |
|      0 | george | --    |
|      0 | maria  | --    |
+--------+--------+-------+

and thus all actors will be selected, which is what we want.