0

Let's say that I have a python list of strings. the strings are tokens of a C++-like language that I have tokenized them partially. but I am left with some strings that are haven't been tokenized. The problem that I have a set of symbols of the language that I must include in the list.

Example:

class Test 
{
    method int foo(boolean a, int b) { }
}

The output I need is:

tokens = ['class', 'Test', '{', 'method', 'int', 'foo', '(', 'boolean', 'a', ',', 'int', 'b', ')', '{', '}', '}']

The output I get after I clean the code from whitespaces:

tokens = ['class', 'Test', '{', 'method', 'int', 'foo(boolean', 'a,', 'int', 'b){', '}', '}']

The Code I Use is is using a partial list which is splitted according to white spaces:

    def tokenize(self, tokens):
    """
    Breaks all tokens into final tokens as needed.
    """
    final_tokens = []
    for token in tokens:
        if not have_symbols(token):
            final_tokens.append(token)
        else:
            current_string = ""
            small_tokens = []
            for character in token:
                if character in SYMBOLS_SET:
                    if current_string:
                        small_tokens.append(current_string)
                        current_string = ""
                    small_tokens.append(character)
                else:
                    current_string += character
            final_tokens = final_tokens + small_tokens
    return final_tokens

where SYMBOLS_SET is a set of symbols:

SYMBOLS_SET = {"{", "}", "(", ")", "[", "]", ".", ",", ";", "+", "-", "*", "/", "&", "|", "<", ">", "=", "~"}

and the method have_symbol(token) returns true if token have a symbol from SYMBOL_SET and false otherwise.

I think that it might be a more elegant way to do this, I would be glad for a guidance.

atefsawaed
  • 543
  • 8
  • 14

1 Answers1

1
import re

input = r"""
class Test 
{
    method int foo(boolean a, int b) { }
}"""

SYMBOLS_SET = {"{", "}", "(", ")", "[", "]", ".", ",", ";", "+", "-", "*", "/", "&", "|", "<", ">", "=", "~"}

regexp = r"\s(" + "".join([re.escape(i) for i in SYMBOLS_SET]) + ")"

splitted = re.split(regexp, input)
tokens = [x for x in splitted if x not in [None, ""]]

print(tokens)

gives you:

['class', 'Test', '{', 'method', 'int', 'foo', '(', 'boolean', 'a', ',', 'int', 'b', ')', '{', '}', '}']

Puttin parens around the SYMBOLS makes them a regexp subgroup and thus appearing in the output. The \s (whitespace) we do not want to be included.

Horus
  • 617
  • 7
  • 10