2

I have a list of sentences with each word of a sentence being in a nested list. Such as:

[['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'],
 ['Peter', 'Blackburn'],
 ['BRUSSELS', '1996-08-22']]

And also another list where each word creesponds to an entity tag. Such as:

[['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O'],
 ['B-PER', 'I-PER'],
 ['B-LOC', 'O']]

This is the basic ConLL2003 data but I'm actually using a different data with another language. I only showed this one as an example represantation.

I want convert this list of lists into a JsonL format where the format is:

{"text": "EU rejects German call to boycott British lamb.", "labels": [ [0, 2, "ORG"], [11, 17, "MISC"], ... ]}
{"text": "Peter Blackburn", "labels": [ [0, 15, "PERSON"] ]}
{"text": "President Obama", "labels": [ [10, 15, "PERSON"] ]}

So far I have managed to put the list of list into this format(json list of dicts):

[{'id': 0,
  'text': 'Corina Casanova , İsviçre Federal Şansölyesidir .',
  'labels': [[0, 6, 'B-Person'],
   [7, 15, 'I-Person'],
   [18, 25, 'B-Country'],
   [26, 33, 'B-Misc'],
   [34, 47, 'I-Misc']]},
 {'id': 1,
  'text': "Casanova , İsviçre Federal Yüksek Mahkemesi eski Başkanı , Nay Giusep'in pratiğinde bir avukat olarak çalıştı .",
  'labels': [[0, 8, 'B-Person'],
   [11, 18, 'B-Misc'],
   [19, 26, 'I-Misc'],
   [27, 33, 'I-Misc'],
   [34, 43, 'I-Misc'],
   [59, 62, 'B-Person'],
   [63, 72, 'I-Person']]}]

However, the problem with this is that I want to merge the IOB format together and create a single, start to end entity. I need this format to be able to upload the data on doccano annotation tool. I need the compound entities labeled as one.

Here is the code I wrote to create the above format:

list_json = []

for x, i in enumerate(sentences[0:2]):
    list_json.append({"id": x})
    list_json[x]["text"] = " ".join(i)
    list_json[x]["labels"] = []
    for y, j in enumerate(labels[x]):
        if j in ['B-Person', 'I-Person', 'B-Country'...(private data)]:
            word = i[y]
            wordStartIndex = list_json[x]["text"].find(word)
            wordEndIndex = list_json[x]["text"].index(word) + len(word)
            list_json[x]["labels"].append([wordStartIndex, wordEndIndex, j])

I tried converting the above format into the format I wan. ie. merging IOB tags. Here is what I have tried so far that didn't work.

new_labels = []

for y, i in enumerate(list_json):
    label_names = [item[2] for item in i["labels"]]
    label_BIO = [item[0] for item in label_names]
    k = 0
    for index in range(len(label_BIO)-1):
        
        if (label_BIO[index] == "B" and label_BIO[index+1] == "I") or (label_BIO[index] == "I" and label_BIO[index+1] == "I"):
            k += 1
    
    for x in range(len(i["labels"])-1):
        
        
        if i["labels"][x][2][0] == "B" and i["labels"][x+1][2][0] == "I":
            new_labels.append([i["labels"][x][0],i["labels"][x+k-1][1],i["labels"][x][2][2:]])
                
        elif i["labels"][x][2][0] != "I" and i["labels"][x+1][2][0] != "I":
            new_labels.append([i["labels"][x][0], i["labels"][x][1], i["labels"][x][2]])

The problem with this block of code is that I can't determine the length of the sequence for the consecutive sequences. So for each element of the list k is always stable. I need k to change for the next sequence in the same list.

Here is the error I get:

IndexError                                Traceback (most recent call last)
<ipython-input-93-420750229f93> in <module>
---> 19             new_labels.append([i["labels"][x][0],i["labels"][x+k-1][1],i["labels"][x][2][2:]])
     20 
     21         elif i["labels"][x][2][0] != "I" and i["labels"][x+1][2][0] != "I":

IndexError: list index out of range

I need to determine where exactly I should calculate k each time. K here is the length of the sequence where B follows I and so on.

I also tried this but this only merges 2 of the labels together:

new_labels = []

for y, i in enumerate(list_json):
    I_labels = []
    for x, j in reversed(list(enumerate(i["labels"]))):
        if j[2][0] == "I" and i["labels"][x-1][2][2:] == j[2][2:]:
            new_labels.append([i["labels"][x-1][0],j[1],j[2][2:]])
        elif j[2][0] != "I" and i["labels"][x+1][2][0] != "I":
            new_labels.append([j[0], j[1], j[2]])

Output:

[[26, 47, 'Misc'],
 [18, 25, 'Country'],
 [0, 15, 'Person'],
 [59, 72, 'Person'],
 [27, 43, 'Misc'],
 [19, 33, 'Misc'],
 [11, 26, 'Misc'],
 [0, 8, 'Person']]

But I need the 3 "Misc" labels to be one single label from index 11 to 43.

For anyone wondering: The reason I'm trying to this is because, I have already labeled some amount of the data and tested a prototype model and it seemed to give pretty good results. So I want to label the whole dataset and fix false labels, instead of annotating from scratch. I think this would save me a lot of time.

ps: I'm aware that doccano supports uploading in the ConLL format. But it's broken so I can't upload it that way.

1 Answers1

0

You can convert the sentences to pandas Dataframe with there respective entity tags and join them. Here is an inspiration.

You can also look at this is your data is in usual CoNLL format