0

Used Pymupdf faced the problem of getting information about the text in the pdf file I asked in the library's discord channel about the possibility of obtaining information about intervals, but they told me that the library does not know how to work with them Perhaps there are other libraries that can do this?

I tried to look in other libraries but did not find it. Maybe I missed something....

  • Please provide enough code so others can better understand or reproduce the problem. – Community Jan 27 '23 at 08:29
  • 1
    By *character spacing and word spacing* do you mean the values of the PDF text state parameters of that name or do you mean actual distances between characters and words? – mkl Jan 27 '23 at 10:39
  • @mkl im sorry i meant those parameters: line spacing, paragraph spacing, character spacing – user377394 Jan 27 '23 at 11:13
  • 1
    As per PyMuPDF: Before anyone gets a wrong impression: You **_can_** extract text with all desired metadata detail: text position (bbox), font properties, writing direction, etc. All this down to **_each single character_**. **_And all this works for PDF, XPS, EPUB_** and a handful more document types. **Therefore** PDF-specific constructs like word and character spacing are not returned. – Jorj McKie Jan 27 '23 at 13:26
  • 1
    @user377394 - **_Line spacing_** is available in PyMuPDF, because it is a font property, which can be extracted in PyMuPDF. Also inter-line distance is can easily computed from the line boundary boxes. **_Paragraph spacing_** is not even a PDF concept. But paragraph boundary boxes are available in PyMuPDF. – Jorj McKie Jan 27 '23 at 13:34

2 Answers2

0

disclaimer: I am the author of borb, the library used in this answer

Usually, the information you're looking for is hidden behind layers of abstraction. A PDF library might typically allow you to extract text (and it uses information about word and character spacing to do so), but it does not make this information available to the outside world.

You can use borb to get access to this (low level) information. The key concept here is EventListener. This is an interface. Classes implementing this interface get notified whenever a rendering event has finished.

Rendering events may include:

  • text being rendered
  • images being rendered
  • switching to a new page and so on

There is a class that extracts text. So I would recommend you check out its code. Looking at line 62, we see that any event that is "render a piece of text" gets redirected to its own separate method.

The method _render_text stores the TextRenderInfo objects until a page has finished rendering (at which point it will use the TextRenderInfo objects to determine the text that was on the page).

You can see the "end of page" logic in action on line 87.

Here you see that TextRenderInfo has all kinds of attributes related to position. You can use get_baseline to access it.

Joris Schellekens
  • 8,483
  • 2
  • 23
  • 54
0

i solved my problem by pdfminer.six and pymupdf by getting line and character position thx all of you