I have written code to transform Python code into a list to compute BLEU score:
import re
def tokenize_for_bleu_eval(code):
code = re.sub(r'([^A-Za-z0-9_])', r' \1 ', code)
code = re.sub(r'([a-z])([A-Z])', r'\1 \2', code)
code = re.sub(r'\s+', ' ', code)
code = code.replace('"', '`')
code = code.replace('\'', '`')
tokens = [t for t in code.split(' ') if t]
return tokens
Thanks to this snippet my code struct.unpack('h', pS[0:2])
is parsed properly into the list ['struct', '.', 'unpack', '(', 'h', ',', 'p', 'S', '[', '0', ':', '2', ']', ')']
.
Initially, I thought I need simply to use the ' '.join(list_of_tokens)
but it kills my variable names like this struct . unpack ('h' , p S [ 0 : 2 ] )
and my code is not executable.
I tried to use Regex to stick some variable names but I can't succeed to reverse my function tokenize_for_bleu_eval
to find executable code at the end. Is someone get an idea, perhaps without regex which seems to be too complicated here?
EDIT: We can't just remove all spaces between element of the list because there are examples like items = [item for item in container if item.attribute == value]
where the result of the backtranslation without space would be itemforiteminaifitem[0]==1
which is not valid.