How to extract more than label text items in a single annotation using Google NLP

Question

I have created dataset using Google NLP Entity extraction and I uploaded input data's(train, test, validation jsonl files) like NLP format that will be stored in google storage bucket.

Sample Annotation:

   {
    "annotations": [{
        "text_extraction": {
            "text_segment": {
                "end_offset": 10,
                "start_offset": 0
            }
        },
        "display_name": "Name"
    }],
    "text_snippet": {
        "content": "JJ's Pizza\n "
    }
} {
    "annotations": [{
        "text_extraction": {
            "text_segment": {
                "end_offset": 9,
                "start_offset": 0
            }
        },
        "display_name": "City"
    }],
    "text_snippet": {
        "content": "San Francisco\n "
    }
}

Here is the input text to predict the label as "Name", "City" and "State"

Best J J's Pizza in San Francisco, CA

Result in the following screenshot,

I expect the predicted results would be in the following,

Name : JJ's Pizza City : San Francisco State: CA

Could you explain how you have trained the model. How many samples did you use in the training set? — Kim, Jun 03 '20 at 09:14
@Kim Model is Google Auto-ml NLP. i have 150 samples(train, test, validation) in training set., — dwayneJohn, Jun 03 '20 at 09:47
Is it returning only name? Or do you have also the labels City and State but without any value? Nevertheless, I guess 150 samples is too little to train any model. Try with more data and might change the result. — IMB, Jun 09 '20 at 09:43
@IMB yes it's returning only name, because i put name as first label in the dataset, Those labels City and state has values — dwayneJohn, Jun 10 '20 at 08:12
I would like just to understand what you really meant by the labels "City" and "State" had values! From the picture above we can see that just "Name" label that have a value of 1, and like the other labels have never existed. I will really advise you to add more training data to your model, and more contents to your testing one. — Abdelilah.F, Jul 25 '20 at 02:40

Jofre · Answer 1 · 2020-08-13T09:06:41.440

According to the sample annotation you provided, you're setting the whole text_snippet to be a name (or whatever field you want to extract).

This can confuse the model in understanding that all the text is that entity.

It would be better to have training data similar to the one in the documentation. In there, there is a big chunk of text and then we annotate the entities that we want extracted from there.

As an example, let's say that from these text snippets I tell the model that the cursive part is an entity named a, while the bold part is an entity called b:

JJ Pizza
LL Burritos
Kebab MM
Shushi NN
San Francisco
NY
Washington
Los Angeles

Then, when then the model reads Best JJ Pizza, it thinks all is a single entity (we trained the model with this assumption), and it will just choose the one it matches the best (in this case, it would likely say it's an a entity).

However, if I provide the following text sample (also annotated like cursive is entity a and bold is entity b):

The best pizza place in San Francisco is JJ Pizza.
For a luxurious experience, do not forget to visit LL Burritos when you're around NY.
I once visited Kebab MM, but there are better options in Washington.
You can find Shushi NN in Los Angles

You can see how you're training the model to find the entities within a piece of text, and it will try to extract them according to the context.

The important part about training the model is providing training data as similar to real-life data as possible.

In the example you provided, if the data in your real-life scenario is going to be in the format <ADJECTIVE> <NAME> <CITY>, then your training data should have that same format:

{
    "annotations": [{
        "text_extraction": {
            "text_segment": {
                "end_offset": 16,
                "start_offset": 6
            }
        },
        "display_name": "Name"
    },
    {
        "text_extraction": {
            "text_segment": {
                "end_offset": 30,
                "start_offset": 21
            }
        },
        "display_name": "City"
    }],
    "text_snippet": {
        "content": "Worst JJ's Pizza in San Francisco\n "
    }
}

Note that the point of a Natural Language ML model is to process natural language. If your inputs are going to look as similar/simple/short as that, then it might not be worth going the ML route. A simple regex should be enough. Without the natural language part, it is going to be hard to properly train a model. More details in the beginners guide.

How to extract more than label text items in a single annotation using Google NLP

1 Answers1