-1

I need to extract pdf text using python,but pdfminer and others are too big to use,but when using simple "with open xxx as xxx" method, I met a problem , the content part didn't extract appropriately. The text looks like bytes because it start with b'. My code and the result screenshot:

with open(r"C:\Users\admin\Desktop\aaa.pdf","rb") as file:
    aa=file.readlines()
    for a in aa:
        print(a)

Output Screenshot: Output Screenshot

mkl
  • 90,588
  • 15
  • 125
  • 265
  • you still have to decode the pdf.. – bherbruck Jan 01 '21 at 03:53
  • Could you say more specific? how to convert the code? I have tried many ways , but all didn't work~thx a lot@TenaciousB – linsen1983 Jan 01 '21 at 04:00
  • 1
    I don't think you can without a pdf reading library like you mentioned – bherbruck Jan 01 '21 at 04:23
  • please share you code and output as text instead of image if not possible to share as code then you can use image. – Chandan Jan 01 '21 at 05:31
  • 2
    PDFs are not plaintext. That's why libraries exist to decode them. `open` is equipped to read/write bytes from a file, not perform any actual decoding or encoding of the data it sees. – Silvio Mayolo Jan 01 '21 at 07:24
  • 1
    *"pdfminer and others are too big to use"* - have you considered that they are so big for a reason? Essentially they are so big because you need that much code for adequate text extraction. In particular to *extract Chinese text*; for simple pdfs with English text there are some sort cuts working in benign circumstances, but for CJK text you should not expect such short cuts. – mkl Jan 01 '21 at 09:27
  • 1
    If you want to try and implement text extraction yourself, grab a copy of ISO 32000-1 or ISO 32000-2 (Google for pdf32000 for a free copy of the former) and study that pdf specification. Based on that information you can step by step learn to parse those binary strings to pdf objects, find content streams therein, parse the instructions in those content streams, and retrieve the text drawn by those instructions. – mkl Jan 01 '21 at 09:34

1 Answers1

1

To generate an answer from the comments...

when using simple "with open xxx as xxx" method, I met a problem , the content part didn't extract appropriately

The reason is that PDF is not a plain text format but instead a binary format whose contents may be compressed and/or encrypted. For example the object you posted a screenshot of,

4 0 obj
<</Filter/FlateDecode/Length 210>>
stream
...
endstream
endobj

contains FLATE compressed data between stream and endstream (which is indicated by the Filter value FlateDecode).

But even if it was not compressed or encrypted, you might still not recognize any text displayed because each PDF font object can use its own, completely custom encoding. Furthermore, glyphs you see grouped in a text line do not need to be drawn by the same drawing instruction in the PDF, you may have to arrange all the strings in drawing instructions by coordinate to be able to find the text of a text line.

(For some more details and backgrounds read this answer which focuses on the related topic of replacement of text in a PDF.)

Thus, when you say

pdfminer and others are too big to use

please consider that they are so big for a reason: They are so big because you need that much code for adequate text extraction. This is in particular true for Chinese text; for simple PDFs with English text there are some short cuts working in benign circumstances, but for PDFs with CJK text you should not expect such short cuts.

If you want to try nonetheless and implement text extraction yourself, grab a copy of ISO 32000-1 or ISO 32000-2 (Google for pdf32000 for a free copy of the former) and study that pdf specification. Based on that information you can step by step learn to parse those binary strings to pdf objects, find content streams therein, parse the instructions in those content streams, retrieve the text pieces drawn by those instructions, and arrange those pieces correctly to a whole text.

Don't expect your solution to be much smaller than pdfminer etc...

mkl
  • 90,588
  • 15
  • 125
  • 265