0

I am extracting the text from a .pdf file using PyPDF2 package. I am getting output but not as in it's desired form. I am unable to find where's the problem?

The code snippet is as follows:

import PyPDF2
def Read(startPage, endPage):
    global text
    text = []
    cleanText = " "
    pdfFileObj = open('F:\\Pen Drive 8 GB\\PDF\\Handbooks\\book1.pdf', 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    num_pages = pdfReader.numPages
    print(num_pages)
    while (startPage <= endPage):
        pageObj = pdfReader.getPage(startPage)
        text += pageObj.extractText()
        startPage += 1
    pdfFileObj.close()
    for myWord in text:
        if myWord != '\n':
            cleanText += myWord
    text = cleanText.strip().split()
    print(text)

Read(3, 3)

The output which I am getting at present is attached for the reference and which is as follows:

enter image description here

Any help is highly appreciated.

M S
  • 894
  • 1
  • 13
  • 41
  • 2
    Possible duplicate of [Extracting text from a PDF file using Python](https://stackoverflow.com/questions/34837707/extracting-text-from-a-pdf-file-using-python) – bigbounty Aug 27 '18 at 14:45

1 Answers1

2

this line cleanText += myWord just concatenates all of the words to one long string. if you want to filter '\n', instead of:

for myWord in text:
        if myWord != '\n':
            cleanText += myWord
    text = cleanText.strip().split()

you can do this:

text = [w for w in text if w != '\n']
Amitay Dror
  • 155
  • 6
  • Thanks. But after incorporating your suggestion the output is displayed as: ['B', 'e', 'n', 'j', 'a', 'm', 'i', 'n', 'W', 'e', 'y', 'e', 'r', 's', 'J', 'u', 'd', 'y', 'B', 'o', 'w', 'e', 'n', 'A', 'l', 'a', 'n', 'D', 'i', 'x', 'P', 'h', 'i', 'l', 'i', 'p', 'p', 'e', 'P', 'a', 'l', 'a', 'n', 'q', 'u', 'e', 'E .....], somewhat like this. It treats every letter of the word of the list as a single character. – M S Aug 27 '18 at 15:39
  • I want to display the splitted words from a list. – M S Aug 27 '18 at 16:18
  • oh, there is another thing i haven't noticed. when you add to the text here- `text += pageObj.extractText()` you make it a string ald not a list of strings. use `text += [pageObj.extractText()]` and it will be a list of strings, then i think you should get the result you expect – Amitay Dror Aug 28 '18 at 07:10
  • that is because the `+=` operator in lists expect a list on the other side. so if you give it a string it will be treated as a list of characters, and each of them will be added as an individual item. – Amitay Dror Aug 28 '18 at 07:30
  • Thanks. But, to the best of my knowledge, the list object has no attribute name `extractText()` – M S Aug 28 '18 at 18:06
  • I was referring to a line from your code. The second line in the while loop. Add `[` and `]` for it to work as you expect – Amitay Dror Aug 29 '18 at 05:23