PDF - Split Single Words into Individual Lines - Python 3

Question

I am trying to extract words from a PDF into individual lines, but can only do this with Text files as demonstrated below.

Moreover, the rule is that I cannot convert PDF files to TXT then perform this operation. It must be done on PDF files.

with open('filename.txt','r') as f:
    for line in f:
        for word in line.split():
           print(word)

If filename.txt has just "Hello World!", then this function returns:

Hello
World!

I need to do the same with searchable PDF files as well. Any help would be appreciated.

https://en.wikipedia.org/wiki/Pdftotext – Arkadiusz Drabczyk Dec 05 '19 at 19:21 — Arkadiusz Drabczyk, Dec 05 '19 at 19:21

score 1 · Answer 1 · answered Dec 05 '19 at 20:03

For the PDF, you should use pdf.miner or PyPDF2.

Here is a good article you can use to extract the text, and then you can use Anilkumar's method to extract line by line.

https://medium.com/@rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f

score 1 · Accepted Answer · answered Dec 06 '19 at 11:22

1

Check out PyMuPDF. There's loads of stuff you can do, including get line by line text from a PDF using page.getText()

answered Dec 06 '19 at 11:22

willing_astronomer

91
5

score 0 · Answer 3 · answered Dec 06 '19 at 15:05

You can use pdfreader to extract texts (plain and containing PDF operators) from PDF document

Here is a sample code extracting all the above from all document pages.

from pdfreader import SimplePDFViewer, PageDoesNotExist

fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)

plain_text = ""
pdf_markdown = ""
try:
    while True:
        viewer.render()
        pdf_markdown += viewer.canvas.text_content
        plain_text += "".join(viewer.canvas.strings)
        viewer.next()
except PageDoesNotExist:
    pass

Just want to outline, that text in PDFs usually do not come as "words", they look like commands to a conforming PDF viewer where and how to put a glyph. Which means a single word may be displayed by several commands. Read more on that in PDF 1.7 docs sec.9 - Text

Anilkumar · Answer 4 · 2019-12-05T22:03:13.800

-1

when I saw filename.txt I got confused.

Since you are working with PDF below link might be helpful. See it helps

How to use PDFminer.six with python 3?

edited Dec 05 '19 at 22:03

answered Dec 05 '19 at 19:48

Anilkumar

42
4

This inserts individual characters into new lines for text files. I need to separate words from other words and insert them into new lines for PDF files. – Starbucks Dec 05 '19 at 20:02

PDF - Split Single Words into Individual Lines - Python 3

4 Answers4