I have an oracle sql script containing multiple statements which are separated by ";". I'm trying to remove all comments within this entire script. This includes:
- single line comments
--
till a newline. - block comments /* ... */
- Additionally, I would like to be able to split the entire script into its statements.
I know of the library called sqlparse
that can format to remove code, but I came across this edge case:
edge_case = """myval := oneval || otherval ||--;
-- single comment line
lpad(nvl(myval, ' '), 10, ' ');
"""
formatted = sqlparse.format(edge_case, strip_comments=True)
sqlparse.split(formatted)
# returns ['myval := oneval || otherval ||--;', "lpad(nvl(myval, ' '), 10, ' ');"]
# it splits into 2 statements when it should only be 1!
My idea is to focus on each commenting case, and then try to parse the rest of non-commented code by just string tokens which consists of all valid characters in sql.
from pyparsing import *
# define basic elements
semicolon = Literal(";")
# all valid chars excluding semicolon
all_valid_chars = alphanums + "_-+*/.,:()[]{}<>=&|"
# define an sql token
sql_token = Word(all_valid_chars)
# need to make sure that "--" is a special case
single_line_comment_token = Literal("--")
single_line_comment = single_line_comment_token + SkipTo(line_end) + line_end
test1_single_line_comment = """
-- single comment line
--- also a comment
---- also a comment
-- comment -- continues
-- comment /* also continues */
# - not a comment
"""
single_line_comment.run_tests(test1_single_line_comment)
# also handle multi line comments
multi_line_comment_open = Literal("/*")
multi_line_comment_close = Literal("*/")
multi_line_comment = multi_line_comment_open + SkipTo(multi_line_comment_close) + multi_line_comment_close
test1_multi_line_comment = """
/* multi line comment */
/* outer comment /* inner comment */
/* /* /* still valid */
/* -- still valid */
"""
multi_line_comment.run_tests(test1_multi_line_comment)
test2_multi_line_comment = """
/* multi line comment /* with nested comment should fail! */ */
"""
multi_line_comment.run_tests(test2_multi_line_comment, failure_tests=True)
I'm now stuck here and not sure where to continue.
statement2 = OneOrMore(single_line_comment | multi_line_comment | sql_token) + semicolon
# run tests
print(statement2.parseString("myval := oneval || otherval ||--;"))
My issues:
- The tests for each of these cases work, but I dont know how I should be testing the together to cover overlaps eg.
-- /*
this should be a single line comment etc. - I don't know how to handle the rest of the non commented code properly. For example, all valid characters should include
-
, but this should mess up my single line code? Less technical issues faced: - Honestly not sure where to even begin or how to structure my code since these cases can overlap. I'm not even sure if pyparsing is what I should be doing to solve this problem, but based on questions i've seen trying to remove comments from sql, parsing is the only real robust solution.
run_tests
does not work with multiple lines. This makes it hard to test the block comment although i can useparse_string
.