0

I have written code to transform Python code into a list to compute BLEU score:

import re

def tokenize_for_bleu_eval(code):
    code = re.sub(r'([^A-Za-z0-9_])', r' \1 ', code)
    code = re.sub(r'([a-z])([A-Z])', r'\1 \2', code)
    code = re.sub(r'\s+', ' ', code)
    code = code.replace('"', '`')
    code = code.replace('\'', '`')
    tokens = [t for t in code.split(' ') if t]

    return tokens

Thanks to this snippet my code struct.unpack('h', pS[0:2]) is parsed properly into the list ['struct', '.', 'unpack', '(', 'h', ',', 'p', 'S', '[', '0', ':', '2', ']', ')'].

Initially, I thought I need simply to use the ' '.join(list_of_tokens) but it kills my variable names like this struct . unpack ('h' , p S [ 0 : 2 ] ) and my code is not executable.

I tried to use Regex to stick some variable names but I can't succeed to reverse my function tokenize_for_bleu_eval to find executable code at the end. Is someone get an idea, perhaps without regex which seems to be too complicated here?

EDIT: We can't just remove all spaces between element of the list because there are examples like items = [item for item in container if item.attribute == value] where the result of the backtranslation without space would be itemforiteminaifitem[0]==1 which is not valid.

Jonor
  • 1,102
  • 2
  • 15
  • 31
  • Can you give an example of a correct output for your sample list? – Tzane Jul 02 '21 at 07:13
  • I am not sure what you mean. The intended output is the initial code, i.e. `struct.unpack('h', pS[0:2])` – Jonor Jul 02 '21 at 09:20
  • 1
    Regex looks like a very inappropriate tool here. Did you consider a Python code parser approach? I have not tried that, but [you might find some tips here](https://stackoverflow.com/questions/768634/parse-a-py-file-read-the-ast-modify-it-then-write-back-the-modified-source-c). – Wiktor Stribiżew Jul 02 '21 at 09:59

1 Answers1

2

I am trying to merge the tokens using this script

import re

def tokenize_for_bleu_eval(code):
    code = re.sub(r'([^A-Za-z0-9_])', r' \1 ', code)
    code = re.sub(r'([a-z])([A-Z])', r'\1 \2', code)
    code = re.sub(r'\s+', ' ', code)
    code = code.replace('"', '`')
    code = code.replace('\'', '`')
    tokens = [t for t in code.split(' ') if t]

    return tokens

def merge_tokens(tokens):
    code = ''.join(tokens)
    code = code.replace('`', "'")
    code = code.replace(',', ", ")

    return code

tokenize = tokenize_for_bleu_eval("struct.unpack('h', pS[0:2])")
print(tokenize)  # ['struct', '.', 'unpack', '(', '`', 'h', '`', ',', 'p', 'S', '[', '0', ':', '2', ']', ')']
merge_result = merge_tokens(tokenize)
print(merge_result)  # struct.unpack('h', pS[0:2])

Edit:

I found this interesting idea to tokenize and merge.

import re

def tokenize_for_bleu_eval(code):
    tokens_list = []
    codes = code.split(' ')
    for i in range(len(codes)):
        code = codes[i]
        code = re.sub(r'([^A-Za-z0-9_])', r' \1 ', code)
        code = re.sub(r'([a-z])([A-Z])', r'\1 \2', code)
        code = re.sub(r'\s+', ' ', code)
        code = code.replace('"', '`')
        code = code.replace('\'', '`')
        tokens = [t for t in code.split(' ') if t]
        tokens_list.append(tokens)
        if i != len(codes) -1:
            tokens_list.append([' '])
    
    flatten_list = []

    for tokens in tokens_list:
        for token in tokens:
            flatten_list.append(token)

    return flatten_list

def merge_tokens(flatten_list):
    code = ''.join(flatten_list)
    code = code.replace('`', "'")

    return code

test1 ="struct.unpack('h', pS[0:2])"
test2 = "items = [item for item in container if item.attribute == value]"
tokenize = tokenize_for_bleu_eval(test1)
print(tokenize)  # ['struct', '.', 'unpack', '(', '`', 'h', '`', ',', ' ', 'p', 'S', '[', '0', ':', '2', ']', ')']
merge_result = merge_tokens(tokenize)
print(merge_result)  # struct.unpack('h', pS[0:2])

tokenize = tokenize_for_bleu_eval(test2)
print(tokenize)  # ['items', ' ', '=', ' ', '[', 'item', ' ', 'for', ' ', 'item', ' ', 'in', ' ', 'container', ' ', 'if', ' ', 'item', '.', 'attribute', ' ', '=', '=', ' ', 'value', ']']
merge_result = merge_tokens(tokenize)
print(merge_result)  # items = [item for item in container if item.attribute == value]

This script will also remember each space from the input

Rizquuula
  • 578
  • 5
  • 15