Let's say that I have a python list of strings. the strings are tokens of a C++-like language that I have tokenized them partially. but I am left with some strings that are haven't been tokenized. The problem that I have a set of symbols of the language that I must include in the list.
Example:
class Test
{
method int foo(boolean a, int b) { }
}
The output I need is:
tokens = ['class', 'Test', '{', 'method', 'int', 'foo', '(', 'boolean', 'a', ',', 'int', 'b', ')', '{', '}', '}']
The output I get after I clean the code from whitespaces:
tokens = ['class', 'Test', '{', 'method', 'int', 'foo(boolean', 'a,', 'int', 'b){', '}', '}']
The Code I Use is is using a partial list which is splitted according to white spaces:
def tokenize(self, tokens):
"""
Breaks all tokens into final tokens as needed.
"""
final_tokens = []
for token in tokens:
if not have_symbols(token):
final_tokens.append(token)
else:
current_string = ""
small_tokens = []
for character in token:
if character in SYMBOLS_SET:
if current_string:
small_tokens.append(current_string)
current_string = ""
small_tokens.append(character)
else:
current_string += character
final_tokens = final_tokens + small_tokens
return final_tokens
where SYMBOLS_SET is a set of symbols:
SYMBOLS_SET = {"{", "}", "(", ")", "[", "]", ".", ",", ";", "+", "-", "*", "/", "&", "|", "<", ">", "=", "~"}
and the method have_symbol(token) returns true if token have a symbol from SYMBOL_SET and false otherwise.
I think that it might be a more elegant way to do this, I would be glad for a guidance.