1

I want to teach an AI to extract specific phrases from PDFs. For example the product name is somewhere described in the document and the AI has to find and extract it. My question is, if it's better to feed the PDFs as images or as an extracted String, as the documents are structured roughly. I hope my question understandable.

Maybe someone has some ideas or keywords for me to begin with too:)

EDIT: Thanks to the hint from lsimmons, I found a way to begin with: https://appliedmachinelearning.blog/2019/04/01/training-deep-learning-based-named-entity-recognition-from-scratch-disease-extraction-hackathon/

I will try this code, just with product names instead of diseases of course. This is called "Named Entity Recognition", for everyone having the same problem. I hope this works.

Helyon
  • 73
  • 9

1 Answers1

0

Turning the characters in the image of the pdf to text would be more of a computer vision task, and it seems like this is not what you're looking to do since you seem more interested in phrase extraction which would be NLP. Therefore the first step is probably to extract the text from the pdfs before feeding the text into NLP libraries for phrase extraction.

There seem to be a good number of libraries in Python to do pdf text extraction - this pops up from a quick Google search. As for the NLP, there are lots of libraries and concepts to learn in this field, again a quick Google search gets this article as an intro to NLP in Python.

lsimmons
  • 677
  • 1
  • 8
  • 22
  • Is "Phrase extraction" the right term for my plans? Or is there a better keyword? – Helyon Nov 26 '19 at 17:42
  • @Helyon not entirely sure what most people would call extracting a product name from text - seems to me that "entity extraction" might be a better term for this. – lsimmons Dec 01 '19 at 18:27