-1

I have a trivial understanding of NLP so please keep things basic.

I would like to run some PDFs at work through a keyword extractor/classifier and build a taxonomy - in the hope of delivering some business intelligence.

For example, given a few thousand PDFs to mine I would like to determine the markets they apply to (we serve about 5 major industries with each one having several minor industries. Each industry and sub-industry has a specific market and in most cases those deal with OEMs, which in turn deal models, which further sub divide into component parts, etc.

I would love to crunch these PDFs into a semi-structured (more a graph actually) output like:

  • Aerospace
    • Manufacturing
      • Repair
        • PT Support
          • M250
          • C20
          • C18
    • Distribution

Can text classifiers do that? Is this too specific? How do you train a system like this that C18 is a "model" of "manufacturer" Rolls Royce of the M250 series and "PT SUPPORT" is a sub-component?

I could build this data manually but would take forever...

Is there a way I could use a text classifier framework and build something more efficiently than regex and python?

Just looking for ideas at this point... Watched a few tutorials on R and python libs but they didn't sound quite like what I am looking for.

spac
  • 346
  • 1
  • 10
Alex.Barylski
  • 2,843
  • 4
  • 45
  • 68

1 Answers1

0

Ok lets break your problem into small sub-problems first , i will break the task as

  1. Read PDF and extract data and meta data from them - take a look at Apache Tikka lib
  2. Any classifier to be more effective need training data - Create training data for text classifier
  3. Then apply any suitable classifier algo .

You can also have look at Carrot2 clustering algo , it will automatically analyse the data and group pdf into different categories.

GaneshP
  • 746
  • 7
  • 25
  • The PDF are the training data - at least that was the hope. I could manually construct the relationships of all manufacturers, models, engines, components etc but its a huge task and defeats the purpose. I was hoping a classifier could determine/estimate when entities are related due to their occurrences and uses in PDF. The manufacturer usually precedes the model in all the documents I have looked at. – Alex.Barylski Mar 13 '16 at 20:41
  • Then the problem becomes more of a clustering than classification, then you can give a try to carrot2 http://stackoverflow.com/a/5064981/847897 – GaneshP Mar 13 '16 at 20:45