2

I an trying to find a way to determine whether or not an SQL SELECT query A is prone to return a subset of the results returned by another query B. Furthermore, this needs to be acomplished from the queries alone, without having access to the respective result sets.

For example, the query SELECT * from employee WHERE salary >= 1000 will return a subset of the results of query SELECT * from employee. I need to find an automated way to perform this validation for any two queries A and B, without accessing the database that stores the data.

If it is unfeasable to achieve this without the aid of an RDBMS, we can assume that I have access to a local, but empty RDBMS, but with the data stored somewhere else. In addition, this check must be done in code, either using an algorithm or a library. The language I am using is Java, but other language will also do.

Many thanks in advance.

João
  • 21
  • 4

2 Answers2

0

I don't know how deep you want to get into parsing queries, but basically you can say that there are two general ways of making a subset of a query (given that source table and projection(select) staying the same):

  • using where clause to add condition to row values
  • using having clause to add conditions to aggregated values

So you can say that if you have two objects that represent queries and say they look something close to this:

 {
    'select': { ... },
    'from': {},
    'where': {},
    'orderby': {}
 }

and they have select, from and orderby to be the same, but one have extra condition in the where clause , you have a subset.

vittore
  • 17,449
  • 6
  • 44
  • 82
  • Thank you for the answer, but I think there is a corner case to that solution. Imagine the queries "SELECT * from employee WHERE salary = 1000" and "SELECT * from employee WHERE salary >= 1000". They both have the same number of conditions in the WHERE clause, yet the clauses are different, and the former is still a subset of the later. – João Apr 25 '16 at 00:26
  • That's not the corner case really, that is just an example of one of the many side cases you need to go through. More importantly you need to ask yourselves why you are trying to do that. – vittore Apr 25 '16 at 01:58
  • I need to do this to create application cache, but I can't just create it based on mapping string queries to result sets. I also need to be able to detect the case I described in the question. – João Apr 25 '16 at 02:27
  • @João well you meet first hard problem in CS - cache invalidation. How you are generating your queries in the first place? – vittore Apr 25 '16 at 03:09
  • 1
    @João i would try to solve much easier problem - create "nested" cache keys that are easy to relate to each other. ie `Employees`, `Employees.BigSalary` and `Employees.BigSalary.OnVacation`. If you invalidate `Employees` you will do all other keys that starts with this key. and would not spend more time on that. – vittore Apr 25 '16 at 03:14
  • I am not generating the queries myself, but they are hard-coded in a web application. The current case study is an online store, but the goal is to make the cache generic, so that it can be used by any application, without requiring application semantics. In addition, the database is hosted and replicated across remote servers, so I need to avoid contacting the RDBMS. That is why I need to figure out how to detect subqueries. If this task sounds weird and esoteric, it is because it is being done in the context of an academic project. – João Apr 25 '16 at 03:18
  • Apache Tomcat and PostgresSQL database. The web application itself is still to be chosen, but it needs to be open source. – João Apr 25 '16 at 13:03
0

One way you might be able to determine if a query is a subset of another is by examining their source tables. If you don't have access to the data itself, this can be tricky. This question references using Snowflake joins to generate database diagrams based on a query without having access to the data itself:

Generate table relationship diagram from existing schema (SQL Server)

If your query is 800 characters or less, the tool is free to use: https://snowflakejoins.com/index.html

I tested it out using the AdventureWorks database and these two queries:

SELECT * FROM HumanResources.Employee

SELECT * FROM HumanResources.Employee WHERE EmployeeID < 200

When I plugged both of them into the Snowflake Joins text editor, this is what was generated:

SnowflakeJoins DB Diagram example

Hope that helps.

Community
  • 1
  • 1
happygostacie
  • 173
  • 1
  • 11
  • Thank you for the suggestion, but this tool seems to be available only as a web app/service. I need to use it as a library and invoke it within code :( – João Apr 25 '16 at 01:26
  • Gotcha. I don't know how to do that without building a Web service call, which I think would be outside the scope of your question. In the future, you should put those kinds of details in the question itself in order to help better guide the responses. – happygostacie Apr 25 '16 at 02:17
  • You are right. I edited the question to make that explicit. – João Apr 25 '16 at 02:40