1

I want to train my model with doccano or an other "Open source text annotation tool" and continuously improve my model.

For that my understanding is, that I can import annotated data to doccano in a format described here: doccano import

So for a first step I have loaded a model and created a doc:

text = "Test text that should be annotated for Michael Schumacher" 
nlp = spacy.load('en_core_news_sm')
doc = nlp(text)

I know I can export the jsonl format (with text and annotated labels) from doccano and train a model with it but I want to know how to export that data from a spaCy doc in python so that i can import it to doccano.

Thanks in advance.

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
reencode
  • 237
  • 5
  • 15
  • When you say "doc", you mean the one you loaded during script runtime or the model where Spacy data is contained? – Tiago Duque Sep 13 '19 at 12:31
  • I mean the doc, as seen in the code, being the return value of nlp(text) – reencode Sep 13 '19 at 13:30
  • The "doc" is a Spacy object filled with the data gathered from processing your text input. Are you sure that is this what you need? – Tiago Duque Sep 13 '19 at 13:35
  • To say it in different words. I want to export what the model already knows about my text or better a bunch of texts.Then in doccano import these like shown in the picture, then correct these annotations and maybe add new ones. Then export from doccano and train my spaCy model with that data. Do you see a better way? – reencode Sep 13 '19 at 13:52

4 Answers4

3

I had a similar task recently, here is how I did it:

import spacy
nlp = spacy.load('en_core_news_sm')

def text_to_doccano(text):
    """
    :text (str): source text
    Returns (list (dict)): deccano format json
    """
    djson = list()
    doc = nlp(text)
    for sent in doc.sents:
        labels = list()
        for e in sent.ents:
            labels.append([e.start_char, e.end_char, e.label_])
        djson.append({'text': sent.text, "labels": labels})
    return djson

Based on your example ...

text = "Test text that should be annotated for Michael Schumacher."
djson = text_to_doccano(text)
print(djson)

... would print out:

[{'text': 'Test text that should be annotated for Michael Schumacher.', 'labels': [[39, 57, 'PERSON']]}]

On a related note, when you save the results to a file the standard json.dump approach for saving JSONs won't work as it would write it as a list of entries separated with commas. AFAIK, doccano expects one entry per line and without a trailing comma. In resolving this, the following snippet works like charm:

import json

open(filepath, 'w').write("\n".join([json.dumps(e) for e in djson]))

/Cheers

fgaim
  • 136
  • 4
  • Useful, got me on the right track, but something's changed and I had to revise this perhaps for current versions of the software. See my answer below. – S'pht'Kr Aug 02 '21 at 01:47
2

Spacy doesn't support this exact format out-of-the-box, but you should be able to write a custom function fairly easily. Take a look at spacy.gold.docs_to_json(), which shows a similar conversion to JSON.

aab
  • 10,858
  • 22
  • 38
1

Doccano and/or spaCy seem to have changed things and there are now some flaws in the accepted answer. This revised version should be more correct with spaCy 3.1 and Doccano as of 8/1/2021...

def text_to_doccano(text):
    """
    :text (str): source text
    Returns (list (dict)): deccano format json
    """
    djson = list()
    doc = nlp(text)
    for sent in doc.sents:
        labels = list()
        for e in sent.ents:
            labels.append([e.start_char - sent.start_char, e.end_char - sent.start_char, e.label_])
        djson.append({'text': sent.text, "label": labels})
    return djson

The differences:

  1. labels becomes singular label in the JSON (?!?)
  2. e.start_char and e.end_char are actually (now?) the start and end within the document, not within the sentence...so you have to offset them by the position of the sentence within the document.
S'pht'Kr
  • 2,809
  • 1
  • 24
  • 43
0

I have used Doccano annotation tool, to generate annotation, I have exported .jsonl file from Doccano Converted to .spaCy training format using following cutomized code.

Step to Follow:

Step 1 : Use doccano tool to annotate the data.

Step 2 : Export annotation file from Doccano which is in .jsonl format.

Step 3 : Pass that .jsonl file to fillterDoccanoData("./root.jsonl") function in below code, In my case I have root.jsonl for me, you can use your own.

Step 4 : User the following code to convert your .jsonl file to .spacy training file.

Step 5 : You can find train.spacy in your working directory as a result finally.

Thanks

import spacy
from spacy.tokens import DocBin
from tqdm import tqdm
import logging
import json

#filtter data to convert in spacy format
def fillterDoccanoData(doccano_JSONL_FilePath):
    try:
        training_data = []
        lines=[]
        with open(doccano_JSONL_FilePath, 'r') as f:
            lines = f.readlines()

        for line in lines:
            data = json.loads(line)
            text = data['data']
            entities = data['label']
            if len(entities)>0:
                training_data.append((text, {"entities" : entities}))
        return training_data
    except Exception as e:
        logging.exception("Unable to process " + doccano_JSONL_FilePath + "\n" + "error = " + str(e))
        return None

#read Doccano Annotation file .jsonl
TRAIN_DATA=fillterDoccanoData("./root.jsonl") #root.jsonl is annotation file name file name 

nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object
for text, annot in tqdm(TRAIN_DATA): # data in previous format
    doc = nlp.make_doc(text) # create doc object from text
    ents = []
    for start, end, label in annot["entities"]: # add character indexes
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    try:
        doc.ents = ents # label the text with the ents
        db.add(doc)
    except:
        print(text, annot)
db.to_disk("./train.spacy") # save the docbin object