6

I am trying to do multi-class classification with textual data. Problem I am facing that I have unstructured textual data. I'll explain the problem with an example. consider this image for example:

example data

I want to extract and classify text information given in image. Problem is when I extract information OCR engine will give output something like this:

18
EURO 46
KEEP AWAY
FROM FIRE
MADE IN CHINA
2226249917581
7412501
DOROTHY
PERKINS

Now target classes here are:

18 -> size
EURO 46 -> price
KEEP AWAY FROM FIRE -> usage_instructions
MADE IN CHINA -> manufacturing_location
2226249917581 -> product_id
7412501 -> style_id
DOROTHY PERKINS -> brand_name

Problem I am facing is that input text is not separable, meaning "multiple lines can belong to same class" and there can be cases where "single line can have multiple classes".

So I don't know how I can split/merge lines before passing it to classification model.
Is there any way using NLP I can split paragraph based on target class. In other words given input paragraph split it based on target labels.

Vivek Mehta
  • 2,612
  • 2
  • 18
  • 30

1 Answers1

5

If you only consider the text, this is a Named Entity Recognition (NER) task.

What you can do is train a Spacy model to NER for your particular problem.

Here is what you will need to do:

  1. First gather a list of training text data
  2. Label that data with corresponding entity types
  3. Split the data into training set and testing set
  4. Train a model with Spacy NER using training set
  5. Score the model using the testing set
  6. ...
  7. Profit!

See Spacy documentation on training specific NER models

Good luck!

amirouche
  • 7,682
  • 6
  • 40
  • 94