2

Can an unstructured PDF be tagged using any tools/libraries? Only source of tagging a PDF was using Adobe Acrobat or Auto-Tag APIs (Not something which I am looking forward to + not so great results imo)

I know the bounding boxes and semantics of the elements (i.e paragraph, lists, headings, tables)

So, is there a way to manipulate PDF trees/objects? preferably in Python or JavaScript.

Any thoughts on the topic is appreciated!!

PDF spec Talks about "StructTreeRoot" for Tagged PDFs. Going deep inside for making these objects would be nerve-racking, so is there any high-level library to manipulate objects?

  • 2
    Such a thing could never be generic and very difficult at best for all documents that are nearly the same. It is best to tag it when generating it, not after the fact. – Kevin Brown Feb 17 '23 at 16:55
  • Agreed with you on the part of tagging beforehand @KevinBrown, but I probably do have all the information to tag a PDF object according to pdf 1.7 spec, just need some manipulating of objects, don't you think? – Harsh Donga Feb 19 '23 at 00:56
  • @KJ Sorry, couldn't get you! I have the information for the element on PDF (Verified by a Human – Harsh Donga Feb 19 '23 at 01:02
  • @KJ I see! Seems like wayyy complicated than it looks theoretically. However, fundamentally I should be able to do it, but your comment helps to get good insight! So, summarizing it, there is no way I can use data(bbox, tag, fonts) to tag an existing pdf in itself (even going to extent and hard-coding all human verification checks)? – Harsh Donga Feb 19 '23 at 10:54
  • 1
    I must disagree with those that just advise giving up the effort because it can't be generalized for all cases, especially when you have stated that you know enough about the content to have some reliable heuristics. Of course we need tools to do that! Not unthinkable to have hundreds of PDFs with a known content structure, but no tags. Why on earth would anyone want to do that manually? You can have a look at PDFKit, which has features for tagging, but I don't think it can open/edit existing files - only build new ones from scratch. – brennanyoung Feb 27 '23 at 14:03
  • Definitely would take advice from veterans in PDF! @brennanyoung i do believe that it should be possible! I don't expect a PDF/UA or something, just looking to add tag tree (structuretreeroot) for content objects. If someone can just guide a bit that would be so so so helpful! Thanks for the comment :) – Harsh Donga Feb 28 '23 at 11:17
  • @brennanyoung would I be able to re-create with exact formatting!? – Harsh Donga Feb 28 '23 at 11:20
  • 1
    I don't know. Most of the semantics get lost if tags aren't written at the original authoring time, so you'll be working purely with heuristic analysis (I assume this is what Acrobat auto-tagging does). However, if you have some familiarity with the structure of the files, it could be 'informed guesswork'. I understand that LibreOffice can open PDF in some kind of editable format and has a scripting interface. This might be worth a look too. There are a lot of "news items" on pdfa.org about efforts to offer round-trip editing of PDF (via word, or html or other steps). – brennanyoung Mar 01 '23 at 12:44

2 Answers2

1

A this time there is a good overview at https://commonlook.com/auto-tagging-pdfs/

Conclusion
Automated tagging solutions can be helpful to get the process started, but, in the end, none of them are perfect, some are downright lousy, and you’re most likely going to have to at least manually verify some stuff and probably have to fix a lot, too.

Tagging a PDF with all that entails needs to be done by the PDF writer so here is this page as Tagged by MS Edge or you can use Chromium/Foxit/Skia (e.g. use Chrome or Chromium Portable).

"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe" --headless --print-to-pdf=C:\data\output.pdf --virtual-time-budget=1000 https://stackoverflow.com/questions/75483409/can-i-tag-a-pdf-programmatically/75500169

Consider how impossible this may be to do retrospectively word by word or even sentence or paragraph at a time, as PDF does not inherently have such constructions. Things like H1 are discarded by the paper printout generator as unrequired superfluous bloat for a printer.

enter image description here

OK the prime reason for tagging is the human challenged reader, so with a tagged PDF lets see how it fares. Here we are only dealing with one simple page without images or tables (the two most common reasons for checking tags)

enter image description here

So programmatically how will an iterative application driven by Python resolve the residual requirements which are missing.

Language, as a Human I know the language is English (that should have been obvious to a browser that speaks aloud)

The Title is missing but again that should be obvious is "TAGGING PDFS" suitable as a working title for approval by another person? Lets temporarily ignore the major errors that tagging and order of tabs is wrong. A human with eyes and brain to analyse why, can fix those, as they progress through all the pages human aspects, so can the "Human" read / navigate logically? will itself resolve the tags order, and at the same time, check if the page is visually suitable contrast for visually challenged.

So the tagging of a PDF is best done at the time a human completes their retrospective use of the page, and that is best done using "Pre-flight" "Post-flight" GUI applications, such as Acrobat.

K J
  • 8,045
  • 3
  • 14
  • 36
  • I appreciate your help a lot! However, I wishfully would override things which require total manual effort (colour contrast). I am not looking to make it PDF/UA or any specification, I wish to have content tagged (marked). Also, I have the info for reading order, even the language (use in-house and 3rd party models for that). With this info, does your answer change? – Harsh Donga Feb 28 '23 at 11:26
  • Checkout link below, they could export the file as tagged pdf, I just want to know how this would work, I am totally fine skipping the manual task and keeping all elements as paragraphs. https://youtu.be/j3nKzpFH69g?t=4724 – Harsh Donga Apr 04 '23 at 10:56
  • Appreciate your effort highly to go through with the link! *How it works is they use a robot to disassemble source document then start afresh by building the way a human would order it for another human (per the ADA requirements).* This is my real question, and don't think this need AI! Sorry to take so much time to absorb valuable information you have been helping out with everytime! – Harsh Donga Apr 06 '23 at 11:07
  • I believe I have failed to get the question across clearly! If possible we can connect on some other platform, as you are the only one showing valuable guidance! I will update the post with the findings after discussion. – Harsh Donga Apr 07 '23 at 12:06
  • Kind sir, at this point I am TOTALLY fine with how wrong the tags would be marked to certain paragraph, i just need to know HOW I would do that programmatically! I want to know if its possible to control (code) over marking a pdf element. Without using random softwares, can i tag an element or not is the only question! – Harsh Donga Apr 08 '23 at 16:01
  • Would delete it later, but I am appreciative of your effort to come back to this post and try to help a novice out! – Harsh Donga Apr 08 '23 at 16:03
0

A free service for some major PDFix features. Autotag is based on their internal algorithm which is customizable.

https://pdfix.io/add-tags-to-pdf/

Can be used in various languages or CLI.

For Python users here's an example of utilizing the AI object detection model for autotagging PDF content.

https://github.com/pdfix/pdfix-autotag-deepdoctection

Sam Piston
  • 11
  • 1