1

So I fed in a dataframe of sentences for token prediction in BERT, and I received as output along with the predictions, the sentences split into words. Now i want to revert my dataframe of the split/tokenized sentences and predictions back to the original sentence.(of course i have the original sentence, but i need to do this process so that the predictions are in harmony with the sentence tokens)

original sentence
You couldn't have done any better because if you could have, you would have.

Post processing
['[CLS]', 'You', 'couldn', "'", 't', 'have', 'done', 'any', 'better', 'because', 'if', 'you', 'could', 'have', ',', 'you', 'would', 'have', '.', '[SEP]']

I identified three processes necessary. 1. Remove quote marks 2. removes the CLS ,SEP and their extra quote marks and commas, 3. remove the commas separating the words and merge them.

def fix_df(row):
    sentences = row['t_words'] 
    return remove_edges(sentences)

def remove_edges(sentences):
    x = sentences[9:-9]
    return remove_qmarks(x)

def remove_qmarks(x):
    y = x.replace("'", "")
    return join(y)

def join(y):
    z = ' '.join(y)
    return z


a_df['sents'] = a_df.apply(fix_df, axis=1) 

The first two functions largely worked correctly, but the last one did not. instead, i got a result that looked like this.

Y o u , c o u l d n , " " , t , h a v e, d o n e ,...

The commas didnt go away, and the text got distorted instead. I am definitely missing something. what could that be?

kay fresh
  • 121
  • 2
  • 10
  • 1
    Could you provide a sample of `row['t_words']`? – Anwarvic Apr 27 '20 at 15:49
  • 2
    First of all, you have no code that removes commas. Second, your processing function has information loss - the spaces. There's no way to re-establish the sentences without being specific about where to put the spaces. – vitalious Apr 27 '20 at 15:55
  • @Anwarvic . I did above. post_processing is a sample of row[t_words], original is the original sentence – kay fresh Apr 27 '20 at 15:56
  • @vitalolious how might i solve these two problems? – kay fresh Apr 27 '20 at 15:57
  • Are you sure that post processing returns a *string*? It really, really looks like it's a basic *list* that got turns into a string somewhere earlier. Which, unfortunately, makes all of your efforts thereafter an [X/Y Problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). – Jongware Apr 27 '20 at 15:59
  • @kayfresh, try to change `remove_qmarks()` function to be `y = x.replace(",", "")`. And `z = ''.join(y)` in `join()` function. – Anwarvic Apr 27 '20 at 16:01
  • @usr2564301 It begins as a dataframe of sentences just like a sample from 'original sentence', and then after it i feed it into pytorch BERT, it gets tokenized and each corresponding token gets tagged. and then it is returned as a list of list tokens, which i have reshaped back into a dataframe. – kay fresh Apr 27 '20 at 16:02
  • @Anwarvic following your ideas, i would lose the regular commas that are part of the sentence. And i dont want that. – kay fresh Apr 27 '20 at 16:07
  • @usr2564301 If i am to consider this as a kind of X/Y problem, the X problem would lead back to the very way BERT tokenization works – kay fresh Apr 27 '20 at 16:14
  • But you say BERT (with which I am not familiar) returns a list (of lists). Somewhere your next manipulations change it into its string representation. It's really worth investigating because all problems you are encountering now will magically disappear. You can even test this by copying the current resulting string you have now as literal code, and inspecting what it contains. – Jongware Apr 27 '20 at 16:17
  • 1
    @usr2564301 let me provide more context. BERT uses wordpiece tokenization, which breaks down a sentence not into words, but subwords( e.g mandarin = man# da# rin# or something like that.). However even though it does the same to the words when preparing them for predictions, it outputs tokens matched to words. at heart, token classification is concerned with words, not sentences. – kay fresh Apr 27 '20 at 16:42

1 Answers1

2

The result string really, really looks like a string representation of an otherwise perfectly normal list, so let's have Python convert it back to a list, safely, per Convert string representation of list to list:

import ast

result = """['[CLS]', 'You', 'couldn', "'", 't', 'have', 'done', 'any', 'better', 'because', 'if', 'you', 'could', 'have', ',', 'you', 'would', 'have', '.', '[SEP]']"""

result_as_list = ast.literal_eval(result)

Now we have this

['[CLS]', 'You', 'couldn', "'", 't', 'have', 'done', 'any', 'better', 'because', 'if', 'you', 'could', 'have', ',', 'you', 'would', 'have', '.', '[SEP]']

let's go over your steps again. First, "remove the quote marks". But there aren't any (obsolete) quote marks, because this is a list of strings; the extra quotes you see in the representation are only because that is how a string is represented in Python.

Next, "remove the beginning and end markers". As this is a list, they're just the first and last elements, no further counting needed:

result_as_list = result_as_list[1:-1]

Next, "remove the commas". As in the first step, there are no (obsolete) comma's; they are part of how Python shows a list and are not there in the actual data.

So we end up with

['You', 'couldn', "'", 't', 'have', 'done', 'any', 'better', 'because', 'if', 'you', 'could', 'have', ',', 'you', 'would', 'have', '.']

which can be joined back into the original string using

result_as_string = ' '.join(result_as_list)

and the only problem remaining is that BERT apparently treats apostrophes, commas and full stops as separate 'words':

You couldn ' t have done any better because if you could have , you would have .

which need a bit o'replacing:

result_as_string = result_as_string.replace(' ,', ',').replace(' .','.').replace(" ' ", "'")

and you have your sentence back:

You couldn't have done any better because if you could have, you would have.

The only problem I see is if there are leading or closing quotes that aren't part of a contraction. If this is necessary, you can replace the space-quote-space replacement with a more focused one targeting specifically "couldn't", "can't", "aren't" etc.

Jongware
  • 22,200
  • 8
  • 54
  • 100