0

Dermatology A dermatologist is a physician with training and expertise in the diagnosis and medical/surgical management of diseases of the skin, hair and nails, and mucous membranes.

Gynecology An obstetrician/gynecologist focuses on the health of women before, during, and after pregnancy, diagnosing and treating conditions of the reproductive system and associated disorders

Please consider the above information are present in a PDF file. My job is to retrieve the Specialty (Dermatology/Gynecology) based on the matching of fields like (Skin, pregnancy). Any suggestion. Thanks.

SDS
  • 169
  • 1
  • 2
  • 9
  • Welcome to Stack Overflow. What have you tried? Googling PDF parsing in Python returns many results...[How to Ask](https://stackoverflow.com/help/how-to-ask) – Elletlar Aug 01 '18 at 09:36
  • Welcome to [Stack Overflow!](https://stackoverflow.com) Please take the [tour](https://stackoverflow.com/tour), have a look around, and read [How do I ask a good question?](https://stackoverflow.com/help/how-to-ask) and provide a [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve). – Cezar Cobuz Aug 01 '18 at 09:36
  • Yes, I tried that. Most of the suggestions were for using PDFMINER. But, I am not sure if pdfminer supports Python 3x. Thanks for your response. – SDS Aug 01 '18 at 09:50

1 Answers1

0

Create 2 different lists of words specific to Dermatology, Gynecology respectively like this:

dermatology_list = ["skin", "hair", "nails"]
gynecology_list = ["women", "pregnancy", "reproductive"]

Start with a score = 0, Parse the PDF file one word at a time, if the word is in Dermatology-list score += 1 if the word is in Gynecology-list score -= 1. At the end, if the score is positive your answer is Dermatology, and if the score is negative the answer is Gynecology. This answer should help you with parsing the PDF and extracting the data.

Cezar Cobuz
  • 1,077
  • 1
  • 12
  • 34