Text classification

Question

I have a trivial understanding of NLP so please keep things basic.

I would like to run some PDFs at work through a keyword extractor/classifier and build a taxonomy - in the hope of delivering some business intelligence.

For example, given a few thousand PDFs to mine I would like to determine the markets they apply to (we serve about 5 major industries with each one having several minor industries. Each industry and sub-industry has a specific market and in most cases those deal with OEMs, which in turn deal models, which further sub divide into component parts, etc.

I would love to crunch these PDFs into a semi-structured (more a graph actually) output like:

Aerospace
- Manufacturing
  - Repair
    - PT Support
      - M250
      - C20
      - C18
- Distribution

Can text classifiers do that? Is this too specific? How do you train a system like this that C18 is a "model" of "manufacturer" Rolls Royce of the M250 series and "PT SUPPORT" is a sub-component?

I could build this data manually but would take forever...

Is there a way I could use a text classifier framework and build something more efficiently than regex and python?

Just looking for ideas at this point... Watched a few tutorials on R and python libs but they didn't sound quite like what I am looking for.

What you want is entity linking I think, see https://en.wikipedia.org/wiki/Entity_linking — amirouche, Apr 04 '18 at 18:36

score 0 · Answer 1 · answered Mar 13 '16 at 20:36

0

Ok lets break your problem into small sub-problems first , i will break the task as

Read PDF and extract data and meta data from them - take a look at Apache Tikka lib
Any classifier to be more effective need training data - Create training data for text classifier
Then apply any suitable classifier algo .

You can also have look at Carrot2 clustering algo , it will automatically analyse the data and group pdf into different categories.

answered Mar 13 '16 at 20:36

GaneshP

746
7
25

The PDF are the training data - at least that was the hope. I could manually construct the relationships of all manufacturers, models, engines, components etc but its a huge task and defeats the purpose. I was hoping a classifier could determine/estimate when entities are related due to their occurrences and uses in PDF. The manufacturer usually precedes the model in all the documents I have looked at. – Alex.Barylski Mar 13 '16 at 20:41
Then the problem becomes more of a clustering than classification, then you can give a try to carrot2 http://stackoverflow.com/a/5064981/847897 – GaneshP Mar 13 '16 at 20:45

Text classification

1 Answers1