How to extract a PDF's text using pdfrw

Question

Can pdfrw extract the text out of a document?

I was thinking something along the lines of

from pdfrw import PdfReader
doc = PdfReader(pdf_path)
page_texts = []
for page_nr in doc.numPages:
    page_texts.append(doc.getPage(page_nr).parse_page())  # ..or something

maxTwo · Answer 1 · 2018-04-22T09:30:27.563

2

In the docs the explain how to extract the text. However, it's just a bytestream. You could iterate over the pages and decode them individually.

from pdfrw import PdfReader
doc = PdfReader(pdf_path)
for page in doc.pages:
    bytestream = page.Contents.stream # This is a string with bytes, Not a bytestring
    string = #somehow decode bytestream. Maybe using zlib.decompress
    # do something with that text

Edit: May be worth nothing that pdfrw does not yet support text decompression due to its complexity according to the author.

edited Apr 22 '18 at 09:30

answered Apr 22 '18 at 09:18

maxTwo

33
7

`zlib.decompress(stream.encode("latin1"))` (although it's weird API that `stream` returns a string) – user202729 Feb 23 '21 at 15:16
Keep in mind that decoding the content stream is not the same as actually extracting the text. The content stream contains postscript operators that render the text. This means that even if you decode the content stream, you will still need to parse those instructions to figure out which glyphs are being rendered at which position. This solution does not really bring you closer to getting the text from the PDF. – Joris Schellekens Aug 03 '21 at 14:25

score 2 · Answer 2 · edited Jan 31 '19 at 15:43

2

Depends on which filters are applied to the page.Contents.stream. If it is only FlateDecode you can use pdfrw.uncompress.uncompress([page.Contents]) to decode it.

Note: Give the whole Contents object in a list to the function

Note: This is not the same as pdfrw.PdfReader.uncompress()

And then you have to parse the string to find your text. It will be be in blocks of lines between BT (begin text) and ET (end text) markers on lines ending in either 'TJ' or 'Tj' inside round brackets.

edited Jan 31 '19 at 15:43

Michal

2,078
23
36

answered Jan 31 '19 at 10:56

Eddie

140
1
7

You are missing quite a few crucial steps. You would need to process the fonts in the PDF (the TJ operator can refer to glyph indices directly, and then you'd have no idea which unicode character is mapped to which glyph). You would also need to parse other instructions (for instance those instructions manipulating the transformation matrix) to keep track of **where** text is being drawn. Since there is no requirement that text in a PDF is rendered sequentially. – Joris Schellekens Aug 03 '21 at 14:28

score -2 · Answer 3 · edited Mar 17 '17 at 18:27

-2

Here's an example that may be useful:

for pg_num in range(number_of_pages):

    pg_obj = pdfreader.getPage(pg_num)

    print(pg_num)

    if re.search(r'CSE', pg_obj.extractText()):
        cse_count+= 1
        pdfwriter.addPage(pg_obj)

Here extractText() would extract the text of the page containing the keyword CSE

edited Mar 17 '17 at 18:27

FOR

4,260
2
25
36

answered Mar 17 '17 at 14:36

kanishka makhija

155
1
1
4

3

The method extractText() ist indeed from the package PyPDF2, as mentioned by Susovan Dey. This method is not available in pdfrw. – MightyCurious Mar 13 '18 at 13:45

How to extract a PDF's text using pdfrw

3 Answers3