So I fed in a dataframe of sentences for token prediction in BERT, and I received as output along with the predictions, the sentences split into words. Now i want to revert my dataframe of the split/tokenized sentences and predictions back to the original sentence.(of course i have the original sentence, but i need to do this process so that the predictions are in harmony with the sentence tokens)
original sentence
You couldn't have done any better because if you could have, you would have.
Post processing
['[CLS]', 'You', 'couldn', "'", 't', 'have', 'done', 'any', 'better', 'because', 'if', 'you', 'could', 'have', ',', 'you', 'would', 'have', '.', '[SEP]']
I identified three processes necessary. 1. Remove quote marks 2. removes the CLS ,SEP and their extra quote marks and commas, 3. remove the commas separating the words and merge them.
def fix_df(row):
sentences = row['t_words']
return remove_edges(sentences)
def remove_edges(sentences):
x = sentences[9:-9]
return remove_qmarks(x)
def remove_qmarks(x):
y = x.replace("'", "")
return join(y)
def join(y):
z = ' '.join(y)
return z
a_df['sents'] = a_df.apply(fix_df, axis=1)
The first two functions largely worked correctly, but the last one did not. instead, i got a result that looked like this.
Y o u , c o u l d n , " " , t , h a v e, d o n e ,...
The commas didnt go away, and the text got distorted instead. I am definitely missing something. what could that be?