3

I have a text pattern matching problem that I could use some direction with. Not being very familiar with pattern recognition overall, I don't know if this is one of those "oh, just use blah-blah algorithm", or this is a really hard pattern problem.

The general statement of what I want to do is identify similarities between a series of SQL statements, in order to allow me to refactor those statements into a smaller number of stored procedures or other dynamically-generated SQL snippets. For example,

SELECT MIN(foo) FROM bar WHERE baz > 123;
SELECT MIN(footer) FROM bar;
SELECT MIN(foo), baz FROM bar;

are all kind of the same, but I would want to recognize that the value inside the MIN() should be a replaceable value, that I may have another column in the SELECT list, or have an optional WHERE clause. Note that this example is highly cooked up, but I hope it allows you to see what I am after.

In terms of scope, I would have a set of thousands of SQL statements that I would hope to reduce to dozens (?) of generic statements. In research so far, I have come across w-shingles, and n-grams, and have discarded approaches like "bag of words" because ordering is important. Taking this out of the SQL realm, another way of stating this problem might be "given a series of text statements, what is the smallest set of text fragments that can be used to reassemble those statements?"

tomo
  • 1,492
  • 13
  • 13

2 Answers2

3

What you really want is to find code clones across the code base.

There's a lot of ways to do that, but most of them seem to ignore the structure that the (SQL) language brings. That structure makes it "easier" to find code elements that make conceptual sense, as opposed to say N-grams (yes, "FROM x WHERE" is common but is an awkward chunk of SQL).

My abstract syntax tree (AST) based clone detection scheme parses source text to ASTs, and then finds shared trees that can be parameterized in a way that produces sensible generalizations by using the language grammar as a guide. See my technical paper Clone Detection Using Abstract Syntax Trees.

With respect to OP's example:

  • It will recognize that the value inside the MIN() should be a replaceable value
  • that the SELECT singleton column could be extended to a list
  • and that the WHERE clause is optional

It won't attempt to make those proposals, unless it find two candidate clones that vary in way these generalizations explain. It gets the generalizations basically by extracting them from the (SQL) grammar. OP's examples have exactly enough variation to force those generalizations.

A survey of clone detection techniques (Comparison and Evaluation of Code Clone Detection Techniques and Tools: A Qualitative Approach) rated this approach at the top of some 30 different clone detection methods; see table 14.

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • This sounds very much along the lines of what I need to be doing. Your paper and the survey are extremely helpful in pointing the way. It also makes me want to look at code differencing tools, and see what is behind how they work. – tomo Oct 18 '16 at 14:11
  • To complete this thought, I also found this SO topic helpful, http://stackoverflow.com/questions/805626/diff-algorithm, which in turn pointed me at this C# implementation: http://www.mathertel.de/Diff/ – tomo Oct 18 '16 at 16:33
  • IMHO, the most interesting code differencing tools use sort of the same base technology but work as a kind of complement of the clone detector: they *compare* abstract syntax trees, and report what is *different* rather than reporting what is the same. See "Smart Differencer" at my site; no paper published for this one but you should be able to find technical papers on abstract syntax tree differencing at scholar.google.com – Ira Baxter Oct 19 '16 at 03:42
  • Agreed on the value of syntax and semantics, Ira. The Diff approach worked very well for what I needed to do immediately, but it would be far more valuable to run the SQL through a parser and somehow compare the parts more intelligently. Knowing that a piece of SQL had an extra column, or an extra clause in the WHERE would be great. But I would have needed to feed the SQL into Microsoft's parser (assuming I could get at it) and consume the output (assuming I could make sense of it). – tomo Oct 21 '16 at 21:13
  • *but it would be far more valuable to run the SQL through a parser and somehow compare the parts more intelligently* ... See "Smart Differencer" at my site. Sigh. – Ira Baxter Oct 21 '16 at 21:38
1

Question is a bit too broad, but I would suggest to give a shot the following approach:

This sounds like a document clustering problem, where you have a set of pieces of text (SQL statements) and you want to cluster them together to find if some of the statements are close to each other. Now, the trick here is in the distance measure between text statements. I would try something like edit distance

So in general the following approach could work:

  • Do some preprocessing of the sql statements you have. Tokenization, removing some words from statements etc. Just be careful here - you are not analysing just some natural language text, its an SQL statements so you will need some clever approach.
  • After that, try to write a function which would count distance between 2 sql queries. The edit distance should work for you.
  • Finally, try to run document clustering on all your SQL queries, using edit distance as a distance measure for clustering algorithm

Hope that helps.

Maksim Khaitovich
  • 4,742
  • 7
  • 39
  • 70
  • Thank you, that is helpful. But I think the other answer on "code clones" is more on target. – tomo Oct 18 '16 at 14:07