0

I have to convert whole pdf to text. i have seen at many places converting pdf to text but particular page.

 from PyPDF2 import PdfFileReader
    import os
    def text_extractor(path):
        with open(os.path.join(path,file), 'rb') as f:
            pdf = PdfFileReader(f)
###Here i can specify page but i need to convert whole pdf without specifying pages###
            page = pdf.getPage(0)
            text = page.extractText()
            print(text)
    if __name__ == '__main__':
        path="C:\\Users\\AAAA\\Desktop\\BB"
        for file in os.listdir(path):
            if not file.endswith(".pdf"):
                continue
            text_extractor(path)

How to convert whole pdf file to text without using getpage()??

saicharan
  • 435
  • 6
  • 18
  • Does this answer your question? [How to extract text from a PDF file?](https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file) – rdmolony May 10 '21 at 16:10

6 Answers6

3

You may want to use textract as this answer recommends to get the full document if all you want is the text.

If you want to use PyPDF2 then you can first get the number of pages then iterate over each page such as:

 from PyPDF2 import PdfFileReader
    import os
    def text_extractor(path):
        with open(os.path.join(path,file), 'rb') as f:
            pdf = PdfFileReader(f)
###Here i can specify page but i need to convert whole pdf without specifying pages###
            text = ""
            for page_num in range(pdf.getNumPages()):
                page = pdf.getPage(page_num)
                text += page.extractText()
            print(text)
    if __name__ == '__main__':
        path="C:\\Users\\AAAA\\Desktop\\BB"
        for file in os.listdir(path):
            if not file.endswith(".pdf"):
                continue
            text_extractor(path)

Though you may want to remember which page the text came from in which case you could use a list:

page_text = []
for page_num in range(pdf.getNumPages()): # For each page
    page = pdf.getPage(page_num) # Get that page's reference
    page_text.append(page.extractText()) # Add that page to our array
for page in page_text:
    print(page) # print each page
1

You could use tika to accomplish this task, but the output needs a little cleaning.

from tika import parser

parse_entire_pdf = parser.from_file('mypdf.pdf', xmlContent=True)
parse_entire_pdf = parse_entire_pdf['content']
print (parse_entire_pdf)

This answer uses PyPDF2 and encode('utf-8') to keep the output per page together.

from PyPDF2 import PdfFileReader

def pdf_text_extractor(path):
  with open(path, 'rb') as f:
  pdf = PdfFileReader(f)

  # Get total pdf page number.
  totalPageNumber = pdf.numPages

  currentPageNumber = 0

  while (currentPageNumber < totalPageNumber):
    page = pdf.getPage(currentPageNumber)

    text = page.extractText()
    # The encoding put each page on a single line.  
    # type is <class 'bytes'>
    print(text.encode('utf-8'))

    #################################
    # This outputs the text to a list,
    # but it doesn't keep paragraphs 
    # together 
    #################################
    # output = text.encode('utf-8')
    # split = str(output, 'utf-8').split('\n')
    # print (split)
    #################################

    # Process next page.
    currentPageNumber += 1

path = 'mypdf.pdf'
pdf_text_extractor(path)
Life is complex
  • 15,374
  • 5
  • 29
  • 58
1

Try pdfreader. You can extract either plain text or decoded text containing "pdf markdown":

from pdfreader import SimplePDFViewer, PageDoesNotExist

fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)

plain_text = ""
pdf_markdown = ""

try:
    while True:
        viewer.render()
        pdf_markdown += viewer.canvas.text_content
        plain_text += "".join(viewer.canvas.strings)
        viewer.next()
except PageDoesNotExist:
    pass

Maksym Polshcha
  • 18,030
  • 8
  • 52
  • 77
0

PDF is a page-oriented format & therefore you'll need to deal with the concept of pages.

What makes it perhaps even more difficult, you're not guaranteed that the text excerpts you're able to extract are extracted in the same order as they are presented on the page: PDF allows one to say "put this text within a 4x3 box situated 1" from the top, with a 1" left margin.", and then I can put the next set of text somewhere else on the same page.

Your extractText() function simply gets the extracted text blocks in document order, not presentation order.

Tables are notoriously difficult to extract in a common, meaningful way... You see them as tables, PDF sees them as text blocks placed on the page with little or no relationship.

Still, getPage() and extractText() are good starting points & if you have simply formatted pages, they may work fine.

pbuck
  • 4,291
  • 2
  • 24
  • 36
0

I found out a very simple way to do this.

You have to follow this steps:

  1. Install PyPDF2 :To do this step if you use Anaconda, search for Anaconda Prompt and digit the following command, you need administrator permission to do this.

    pip install PyPDF2

If you're not using Anaconda you have to install pip and put its path to your cmd or terminal.

  1. Python Code: This following code shows how to convert a pdf file very easily:

    import PyPDF2
    
    with open("pdf file path here",'rb') as file_obj:
    pdf_reader = PyPDF2.PdfFileReader(file_obj)
    raw = pdf_reader.getPage(0).extractText()
    
    print(raw)
    
0

I just used pdftotext module to get this done easily.

import pdftotext

# Load your PDF
with open("test.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)

# creating a text file after iterating through all pages in the pdf
file = open("test.txt", "w")
for page in pdf:
    file.write(page)
file.close()

Link: https://github.com/manojitballav/pdf-text

Mono
  • 53
  • 1
  • 6