0

I have a PDF with multiple text blocks which are misaligned. I am trying to generate a new PDF with aligned text as per my transformation matrix (known). I can use PyMuPDF (fitz) to extract the text information from the source PDF and insert the text in target PDF, but this way I lose all the structural information (blocks, lines, spans etc.):

import fitz

src_doc = fitz.open('my.pdf')
tgt_doc = fitz.open()

src_page = doc[0]
tgt_page = tgt_doc[0]

text_dict = src_page.get_text('dict')
transform = fitz.Matrix(1, 1) # would be non-identity in practice
tw = fitz.TextWriter(tgt_page.rect)

for block in text['blocks']:
    if block['type'] != 1: # ignore images
        blocks.append(block)
        for line in block['lines']:
            for span in line['spans']:                      
                tw.append(span['origin'], span['text'])

tw.write_text(tgt_page, morph=[fitz.Point([0.0, 0.0]), transform])

tgt_doc.save('aligned.pdf')
src_doc.close()
tgt_doc.close()

This does the job of aligning the text, however loses all information about text structure. tgt_page will have more blocks than src_page.

Can I do the same without compromising the page structure?

I was originally using pikepdf as used in ocrmypdf but unfortunately pikepdf only supports ASCII characters. I am having toruble using it for non-latin text. Any other library that does the job is also okay.

asymptote
  • 1,133
  • 8
  • 15
  • So, I'm a little confused reading your post (I've read it multiple times)... if your script correctly aligns the text, and it **_looks_** correct, isn't that what matters? What happens when you run your script on the corrected PDF? I'd imagine all those block/line/span-level details are there... how could it look right without that information? – Zach Young Jun 08 '22 at 05:42
  • 1
    As a quick test, I copied your script, cleaned up some issues, and ran it with a Matrix that pre-rotates by 5 degs. I then re-run the rotated PDF through it and it does what I expect, rotating the already rotated text even more. So, the script can still read all the blocks/line/spans. – Zach Young Jun 08 '22 at 06:04
  • @ZachYoung the new documents will have different blocks/lines as compared to the old one depending on the format. You can confirm this by `assert len(src_blocks) == len(tgt_blocks)` and so on. – asymptote Jun 08 '22 at 07:37

0 Answers0