3

I have an oracle sql script containing multiple statements which are separated by ";". I'm trying to remove all comments within this entire script. This includes:

  1. single line comments -- till a newline.
  2. block comments /* ... */
  3. Additionally, I would like to be able to split the entire script into its statements.

I know of the library called sqlparse that can format to remove code, but I came across this edge case:

edge_case = """myval := oneval || otherval ||--;
-- single comment line
lpad(nvl(myval, ' '), 10, ' ');
"""

formatted = sqlparse.format(edge_case, strip_comments=True)
sqlparse.split(formatted)

# returns ['myval := oneval || otherval ||--;', "lpad(nvl(myval, ' '), 10, ' ');"]
# it splits into 2 statements when it should only be 1!

My idea is to focus on each commenting case, and then try to parse the rest of non-commented code by just string tokens which consists of all valid characters in sql.

from pyparsing import *

# define basic elements
semicolon = Literal(";")
# all valid chars excluding semicolon
all_valid_chars = alphanums + "_-+*/.,:()[]{}<>=&|"

# define an sql token
sql_token = Word(all_valid_chars)


# need to make sure that "--" is a special case
single_line_comment_token = Literal("--")
single_line_comment = single_line_comment_token + SkipTo(line_end) + line_end
test1_single_line_comment = """
-- single comment line
--- also a comment
---- also a comment
-- comment -- continues
-- comment /* also continues */

# - not a comment
"""
single_line_comment.run_tests(test1_single_line_comment)


# also handle multi line comments
multi_line_comment_open = Literal("/*")
multi_line_comment_close = Literal("*/")
multi_line_comment = multi_line_comment_open + SkipTo(multi_line_comment_close) + multi_line_comment_close

test1_multi_line_comment = """
/* multi line comment */
/* outer comment /* inner comment */

/* /* /* still valid */
/* -- still valid */
"""
multi_line_comment.run_tests(test1_multi_line_comment)

test2_multi_line_comment = """
/* multi line comment /* with nested comment should fail! */ */
"""
multi_line_comment.run_tests(test2_multi_line_comment, failure_tests=True)

I'm now stuck here and not sure where to continue.

statement2 = OneOrMore(single_line_comment | multi_line_comment | sql_token) + semicolon

# run tests
print(statement2.parseString("myval := oneval || otherval ||--;"))

My issues:

  1. The tests for each of these cases work, but I dont know how I should be testing the together to cover overlaps eg. -- /* this should be a single line comment etc.
  2. I don't know how to handle the rest of the non commented code properly. For example, all valid characters should include -, but this should mess up my single line code? Less technical issues faced:
  3. Honestly not sure where to even begin or how to structure my code since these cases can overlap. I'm not even sure if pyparsing is what I should be doing to solve this problem, but based on questions i've seen trying to remove comments from sql, parsing is the only real robust solution.
  4. run_tests does not work with multiple lines. This makes it hard to test the block comment although i can use parse_string.
Chowder
  • 96
  • 1
  • 7
  • What is the original task you want to solve? The only valid code is the code accepted by particular DBMS that manages particular database instance. Comments may be real comments, may be a part of string literal, and string literal may be q-quoted string that may include single quotation mark inside. There are too much cases to handle. – astentx Jun 01 '23 at 17:29
  • 2
    Comments may also be an important part of the code if they contain hints to the SQL engine. – MT0 Jun 01 '23 at 20:50
  • @astentx My priority is really to remove all comments from a given sql script to do some basic code analysis. the script is assumed to be working, hence I do not really care to check the grammar or build a full compiler (which I wouldn't even know how to start). As for strings, I only need to focus on single quoted strings `'` based on Oracle sql. Agree with the too many cases, I'm just trying my best to break it down. – Chowder Jun 02 '23 at 06:27
  • @MT0 didn't really know that, but let's assume it's not required. Otherwise, it's another problem I will have to juggle with. – Chowder Jun 02 '23 at 06:29

1 Answers1

1

For stripping comments (or any pyparsing expression), you should use transform_string, not parse_string. transform_string scans through the input string and applies parse actions and suppressions. Here is some code to strip out python comments from a script:

import pyparsing as pp

comment = "#" + pp.rest_of_line
# could also use the provided `python_style_comment`
# comment = pp.python_style_comment

python_source = """
def hello(s):
    # say a nice hello!
    print(f"Hi, {s}!")
    # add a hashtag for social media
    print("#hello")
"""

# suppress comments and transform the string
print(comment.suppress().transform_string(python_source))

giving:

def hello(s):
    
    print(f"Hi, {s}!")
    
    print("

Oops, this does not detect the fact that #hello is inside a quoted string.

To fix this, we also insert a parser for quoted strings, but these we don't suppress:

# parse quoted strings first, in case they contain a comment
transform_expr = pp.quoted_string | comment.suppress()
print(transform_expr.transform_string(python_source))

Now giving:

def hello(s):
    
    print(f"Hi, {s}!")
    
    print("#hello")

For your SQL comment handling, you'll do much the same:

sql_single_line_comment = '--' + pp.rest_of_line

# use pyparsing's definition for C-style /* ... */ multiline comments
sql_multi_line_comment = pp.c_style_comment

comment_remover = (
    # parse quoted strings separately so that they don't get suppressed
    pp.quoted_string
    | (sql_single_line_comment | sql_multi_line_comment).suppress()
)

sql = "SELECT * FROM STUDENTS; -- watch out for Bobby Tables!"
print(comment_remover.transform_string(sql))

prints:

SELECT * FROM STUDENTS;

I'm glad to see you are using run_tests! If you want to define tests that span multiple lines, than instead of using the multiline string form, pass the test as a list of strings:

expr.run_tests([test_str1, test_str2, test_str3])

Unfortunately, I don't have a version of run_tests that calls transform_string.

Splitting the lines up into separate statements is best done in a second pass, after first stripping out the comments. You might try something like:

semi = pp.Literal(";")
semi.add_parse_action(pp.replace_with(";\n\n")

And then use the same transformation pattern as was used to strip comments (but this time, using the parse action instead of suppressing). Or use scan_string to locate the ";" terminators, and then write out the SQL that lies between semicolons out to their separate files (left as an exercise for the reader).

PaulMcG
  • 62,419
  • 16
  • 94
  • 130
  • Thanks @PaulMcG. Very new to your library and only know how to use `parse_string` thanks to your docs! Why do you use `rest_of_line` vs `skipto(line_end)`? I went with skipto as I could get additional data (which is the commented content) for basically the same work? – Chowder Jun 02 '23 at 07:06
  • Internally, `rest_of_line` is the same as `SkipTo(line_end)` . You could also use the relatively new `...` (Ellipsis) form: `python_comment = pp.Literal("#") + ... + pp.LineEnd()`. Pyparsing will look at the term following `...` and replace the Ellipsis with `SkipTo(next_term)`. I like it because you don't have to repeat the next term (I'm a big fan of DRY). – PaulMcG Jun 02 '23 at 16:45