-4

I wanted to know if it was possible to make a python script to go through tax documents. Basically Check to see whos tax document it is and output it to a text file. I have a lot of documents to go through and have to see who sent it in. Wanted to see if it was possible to make a python script to go through it all and collect the necessary information.

edit: what method would be the best way of achieving this?

yibs
  • 1
  • 3
  • 4
    Yes it is possible. – Max May 23 '18 at 19:52
  • Please see the similar SO Question here: [https://stackoverflow.com/questions/6413441/python-pdf-library](https://stackoverflow.com/questions/6413441/python-pdf-library) – Rohlex32 May 23 '18 at 19:56

1 Answers1

1

Absolutely. Great tutorial for parsing PDF's located here: https://automatetheboringstuff.com/chapter13/

Some code example that may work.

import PyPDF2
pdfFileObj = open('meetingminutes.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
dfReader.numPages

pageObj = pdfReader.getPage(0)
pageObj.extractText()

You can then use Regular Expressions(re) to parse the text to look though the text and find what you want. A great tutorial is located here: https://automatetheboringstuff.com/chapter7/

You should really go though all of automatetheboringstuff.com for basic work automation.

  • could regex be used if each document has a different name and I need to extract the persons name? – yibs May 23 '18 at 20:04
  • Yes. Regex can identify a name after a specific substring. I don't know what the document says, but if it is something like FIRST NAME: John LAST NAME: Doe; you can use regular expressions to identify the word after 'FIRST NAME:' and the word after 'LAST NAME:' – never_comment May 23 '18 at 20:21
  • its more of a tax document, so after business name it provides business name (which I need) then address on a new line, (which I dont need) , Im assuming I have to search for whatever is after business name up until a newline right? – yibs May 23 '18 at 20:28
  • You really need to do the first step of parsing the PDF into a text string. Then look at the text string to figure out what do use regular expressions to extract. I guarantee it can be done without too much effort. – never_comment May 23 '18 at 20:36