1

I want to extract numbers from a PDF file. I want to create a histogram depicting the scores of students who got approved by an university; these scores are stored in a PDF file. What are some ways I can extract them?

AtilioA
  • 419
  • 2
  • 6
  • 19

1 Answers1

1

You first need a PDF parser since Python by default is not capable of reading it. A SO answer posted here Python module for converting PDF to text suggested to use PDFMINER for it - http://www.unixuser.org/~euske/python/pdfminer/index.html

However youve not provided any examples of how the numbers are represented. You need to make some kind of a custom line parser using regex/patterns to define rules how to extract these numbers. The difficulty mainly depends if the PDF contains only raw statistical data, if not, you also need to be careful not to take in all numbers, that is the ones that actually do not refer to any statistical data but are just in a sentence.

You can learn more about regular expressions in python from here https://docs.python.org/3/library/re.html

If regex is new to you, you can learn and experiment with it here http://regexr.com/ .

Community
  • 1
  • 1
Jointts
  • 121
  • 3
  • 11
  • Ok, thanks. The numbers are floats like 731.9, 655.4, 701.8, etc. They range from 600 to 900. There is really anything more about them that should be noted? I've said that in my question because there isn't just data in the pdf. – AtilioA May 10 '17 at 10:21
  • 1
    What you can do is paste the contents of the PDF into http://regexr.com/ and then start building your regex from there, try to experiment with it for example change [A-Z] to [600-900] and see what happens. The matching pattern will be highlighted. This pattern wont do exactly what you want but you need to solve the right pattern yourself. Im just pointing you to the right direction. Finally if you got everything you wanted extracted, you can put the expression into python. Then you can start processing the data. Making regex expressions can be difficult at the start, but dont give up :) – Jointts May 10 '17 at 10:31
  • I used regex and it worked quite well; luckily I got the numbers I wanted without much trouble. Then I created [a histogram](https://i.imgur.com/tY7BN1e.png) with the output. – AtilioA May 10 '17 at 19:38