-1

When I try to read a pdf file which has tabled data using the code below there is no space between the two columns or rows.

import PyPDF2 
pdfFileObj = open('filename.pdf', 'rb',)
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pdfReader.numPages
pageObj = pdfReader.getPage(0)
pageObj.extractText()

Output is as such

'Page 1 of 1MINISTRY OF CORPORATE AFFAIRSRECEIPTG.A.R.7SRN :U16571275Payment 
made into :Service Request Date :03/08/2017Received From :'

Where the space is expected after 1, after A.R.7 and between "U16571275" and "Payment"

Aditya Rao
  • 11
  • 2
  • 4
  • The question seems to be a duplicate for [this](https://stackoverflow.com/a/48458469/5566361) answer. I was facing the same problem of missing white spaces, it helped me solving my problem – Mohsin Ashraf Nov 09 '20 at 07:41

1 Answers1

2
extractText() 

Method returns a string of the page’s text and sometimes the text extraction might not be perfect.

If you are trying to read a PDF file in python you can try Textract module too as an alternative. http://textract.readthedocs.io/en/stable/index.html.

pip install textract

once installed

import textract
text = textract.process('path/to/pdf/file', method='pdfminer')
vanishka
  • 167
  • 2
  • 12