-1

I'm new to NLP. I am looking for recommendations for an Annotation tool to create a labeled NER dataset from raw texts.

In details:

I'm trying to create a labeled data set for specific types of Entities in order to develop my own NER project (rule based at first). I assumed there will be some friendly frameworks that allows create tagging projects, tag text data, create a labeled dataset, and even share projects so several people could work on the same project, but I'm struggling to find one (I admit "friendly" or "intuitive" are subjective, yet this is my experience).

So far I've tried several Frameworks:

  • I tried LightTag. It makes the tagging itself fast and easy (i.e. marking the words and giving them labels) but the entire process of creating a useful dataset is not as intuitive as I expected (i.e. uploading the text files, split to different tagging objects, save the tags, etc.)
  • I've installed and tried LabelStudio and found it less mature then LightTag (don't mean to judge here :))
  • I've also read about spaCy's Prodigy, which offers a paid annotation tool. I would consider purchasing it, but their website only offers a live demo of the the tagging phase and I can't access if their product is superior to the other two products above.

Even in StackOverflow the latest question I found on that matter is over 5 years ago.

Do you have any recommendation for a tool to create a labeled NER dataset from raw text?

3 Answers3

1

⚠️ Disclaimer

I am the author of Acharya. I would limit my answers to the points raised in the question.


Based on your question, Acharya would help you in creating the project and upload your raw text data and annotate them to create a labeled dataset.

It would allow you to mark records individually for train or test in the dataset and would give data-centric reports to identify and fix annotation/labeling errors.

It allows you to add different algorithms (bring your own algorithm) to the project and train the model regularly. Once trained, it can give annotation suggestions from the trained models on untagged data to make the labeling process faster.

If you want to train in a different setup, it allows you to export the labeled dataset in multiple supported formats.

Currently, it does not support sharing of projects.

Acharya community edition is in alpha release. github page (https://github.com/astutic/Acharya) website (https://acharya.astutic.com/)

Doccano is another open-source annotation tool that you can check out https://github.com/doccano/doccano

0

I have used both DOCCANO (https://github.com/doccano/doccano) and BRAT (https://brat.nlplab.org/).

Find the latter very good and it supports more functions. Both are free to use.

Robert Alexander
  • 875
  • 9
  • 24
0

You may try the Automatic Text Annotation Tool for spaCy NER recently developed and available at https://termitexpert.in/annotation_spacy_ner . This tool can convert your raw data into annotated data if you supply Entities and its corresponding items. The annotated data will be in json format that supports spaCy version 2 for developing custom named entity recognition (NER) model.

For example, if you have Entity FRUIT and its corresponding items are (apple, mango, banana). Then, this tool automatically finds each item from your text and annotate them as FRUIT. You can add other Entity and its corresponding items also.

Note: Abobe method works fine with spaCy v2, For using spaCy v3.0, you may have to convert the json data to DocBin format and use it for training, see doc.

Murari Kumar
  • 122
  • 2
  • 12