4

Can pdfrw extract the text out of a document?

I was thinking something along the lines of

from pdfrw import PdfReader
doc = PdfReader(pdf_path)
page_texts = []
for page_nr in doc.numPages:
    page_texts.append(doc.getPage(page_nr).parse_page())  # ..or something
Roman
  • 8,826
  • 10
  • 63
  • 103

3 Answers3

2

In the docs the explain how to extract the text. However, it's just a bytestream. You could iterate over the pages and decode them individually.

from pdfrw import PdfReader
doc = PdfReader(pdf_path)
for page in doc.pages:
    bytestream = page.Contents.stream # This is a string with bytes, Not a bytestring
    string = #somehow decode bytestream. Maybe using zlib.decompress
    # do something with that text

Edit: May be worth nothing that pdfrw does not yet support text decompression due to its complexity according to the author.

maxTwo
  • 33
  • 7
  • `zlib.decompress(stream.encode("latin1"))` (although it's weird API that `stream` returns a string) – user202729 Feb 23 '21 at 15:16
  • Keep in mind that decoding the content stream is not the same as actually extracting the text. The content stream contains postscript operators that render the text. This means that even if you decode the content stream, you will still need to parse those instructions to figure out which glyphs are being rendered at which position. This solution does not really bring you closer to getting the text from the PDF. – Joris Schellekens Aug 03 '21 at 14:25
2

Depends on which filters are applied to the page.Contents.stream. If it is only FlateDecode you can use pdfrw.uncompress.uncompress([page.Contents]) to decode it.

Note: Give the whole Contents object in a list to the function

Note: This is not the same as pdfrw.PdfReader.uncompress()

And then you have to parse the string to find your text. It will be be in blocks of lines between BT (begin text) and ET (end text) markers on lines ending in either 'TJ' or 'Tj' inside round brackets.

Michal
  • 2,078
  • 23
  • 36
Eddie
  • 140
  • 1
  • 7
  • You are missing quite a few crucial steps. You would need to process the fonts in the PDF (the TJ operator can refer to glyph indices directly, and then you'd have no idea which unicode character is mapped to which glyph). You would also need to parse other instructions (for instance those instructions manipulating the transformation matrix) to keep track of **where** text is being drawn. Since there is no requirement that text in a PDF is rendered sequentially. – Joris Schellekens Aug 03 '21 at 14:28
-2

Here's an example that may be useful:

for pg_num in range(number_of_pages):

    pg_obj = pdfreader.getPage(pg_num)

    print(pg_num)

    if re.search(r'CSE', pg_obj.extractText()):
        cse_count+= 1
        pdfwriter.addPage(pg_obj)

Here extractText() would extract the text of the page containing the keyword CSE

FOR
  • 4,260
  • 2
  • 25
  • 36
kanishka makhija
  • 155
  • 1
  • 1
  • 4
  • 3
    The method extractText() ist indeed from the package PyPDF2, as mentioned by Susovan Dey. This method is not available in pdfrw. – MightyCurious Mar 13 '18 at 13:45