I have a list of sentences with each word of a sentence being in a nested list. Such as:
[['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'],
['Peter', 'Blackburn'],
['BRUSSELS', '1996-08-22']]
And also another list where each word creesponds to an entity tag. Such as:
[['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O'],
['B-PER', 'I-PER'],
['B-LOC', 'O']]
This is the basic ConLL2003 data but I'm actually using a different data with another language. I only showed this one as an example represantation.
I want convert this list of lists into a JsonL format where the format is:
{"text": "EU rejects German call to boycott British lamb.", "labels": [ [0, 2, "ORG"], [11, 17, "MISC"], ... ]}
{"text": "Peter Blackburn", "labels": [ [0, 15, "PERSON"] ]}
{"text": "President Obama", "labels": [ [10, 15, "PERSON"] ]}
So far I have managed to put the list of list into this format(json list of dicts):
[{'id': 0,
'text': 'Corina Casanova , İsviçre Federal Şansölyesidir .',
'labels': [[0, 6, 'B-Person'],
[7, 15, 'I-Person'],
[18, 25, 'B-Country'],
[26, 33, 'B-Misc'],
[34, 47, 'I-Misc']]},
{'id': 1,
'text': "Casanova , İsviçre Federal Yüksek Mahkemesi eski Başkanı , Nay Giusep'in pratiğinde bir avukat olarak çalıştı .",
'labels': [[0, 8, 'B-Person'],
[11, 18, 'B-Misc'],
[19, 26, 'I-Misc'],
[27, 33, 'I-Misc'],
[34, 43, 'I-Misc'],
[59, 62, 'B-Person'],
[63, 72, 'I-Person']]}]
However, the problem with this is that I want to merge the IOB format together and create a single, start to end entity. I need this format to be able to upload the data on doccano annotation tool. I need the compound entities labeled as one.
Here is the code I wrote to create the above format:
list_json = []
for x, i in enumerate(sentences[0:2]):
list_json.append({"id": x})
list_json[x]["text"] = " ".join(i)
list_json[x]["labels"] = []
for y, j in enumerate(labels[x]):
if j in ['B-Person', 'I-Person', 'B-Country'...(private data)]:
word = i[y]
wordStartIndex = list_json[x]["text"].find(word)
wordEndIndex = list_json[x]["text"].index(word) + len(word)
list_json[x]["labels"].append([wordStartIndex, wordEndIndex, j])
I tried converting the above format into the format I wan. ie. merging IOB tags. Here is what I have tried so far that didn't work.
new_labels = []
for y, i in enumerate(list_json):
label_names = [item[2] for item in i["labels"]]
label_BIO = [item[0] for item in label_names]
k = 0
for index in range(len(label_BIO)-1):
if (label_BIO[index] == "B" and label_BIO[index+1] == "I") or (label_BIO[index] == "I" and label_BIO[index+1] == "I"):
k += 1
for x in range(len(i["labels"])-1):
if i["labels"][x][2][0] == "B" and i["labels"][x+1][2][0] == "I":
new_labels.append([i["labels"][x][0],i["labels"][x+k-1][1],i["labels"][x][2][2:]])
elif i["labels"][x][2][0] != "I" and i["labels"][x+1][2][0] != "I":
new_labels.append([i["labels"][x][0], i["labels"][x][1], i["labels"][x][2]])
The problem with this block of code is that I can't determine the length of the sequence for the consecutive sequences. So for each element of the list k is always stable. I need k to change for the next sequence in the same list.
Here is the error I get:
IndexError Traceback (most recent call last)
<ipython-input-93-420750229f93> in <module>
---> 19 new_labels.append([i["labels"][x][0],i["labels"][x+k-1][1],i["labels"][x][2][2:]])
20
21 elif i["labels"][x][2][0] != "I" and i["labels"][x+1][2][0] != "I":
IndexError: list index out of range
I need to determine where exactly I should calculate k each time. K here is the length of the sequence where B follows I and so on.
I also tried this but this only merges 2 of the labels together:
new_labels = []
for y, i in enumerate(list_json):
I_labels = []
for x, j in reversed(list(enumerate(i["labels"]))):
if j[2][0] == "I" and i["labels"][x-1][2][2:] == j[2][2:]:
new_labels.append([i["labels"][x-1][0],j[1],j[2][2:]])
elif j[2][0] != "I" and i["labels"][x+1][2][0] != "I":
new_labels.append([j[0], j[1], j[2]])
Output:
[[26, 47, 'Misc'],
[18, 25, 'Country'],
[0, 15, 'Person'],
[59, 72, 'Person'],
[27, 43, 'Misc'],
[19, 33, 'Misc'],
[11, 26, 'Misc'],
[0, 8, 'Person']]
But I need the 3 "Misc" labels to be one single label from index 11 to 43.
For anyone wondering: The reason I'm trying to this is because, I have already labeled some amount of the data and tested a prototype model and it seemed to give pretty good results. So I want to label the whole dataset and fix false labels, instead of annotating from scratch. I think this would save me a lot of time.
ps: I'm aware that doccano supports uploading in the ConLL format. But it's broken so I can't upload it that way.