Getting files of trees from PDF (preferably using Python)

Question

I would like to make a series of files containing the trees in this PDF (http://mica.lif.univ-mrs.fr/d6.clean2-backup.pdf). The names of the files would be the corresponding tree numbers on the left (t0, t1, etc).

I have tried to use python to extract the relevant information and trees, but I'm having trouble. To be specific, when I tried extracting the trees as images (using https://nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html), none of the trees showed up (presumably because the trees aren't the right format). However, when I try extracting it all as text (as https://www.geeksforgeeks.org/working-with-pdf-files-in-python/), the trees lose all their formatting (and some of their information, I think). How could I go about getting the files I want from this PDF? Could it be done in Python? Is there another way that's easier?

Alternatively, the website (http://mica.lif.univ-mrs.fr/) from which I obtained the PDF has the trees in another form (ex: t27 S##1#l# NP#0#2#l#s NP#0#2#r#s VP##3#l# V##4#l#h V##4#r#h NP#1#5#l#s NP#1#5#r#s VP##3#r# S##1#r#). Is there a good way to convert this form into a good visual in the form of trees?

Any help in either of these approaches (or others if people have ideas) would be much appreciated. Thanks!

perhaps you may want to show some of your current code so that someone can better help you find a solution. — Azeame, Sep 20 '18 at 18:07

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

If you look at the metadata of the PDF file you can see that it is a TeX (LaTeX)-created file. I'd suggest you go get the original LaTeX source file (instead of a PDF) from whomever created this document rather than trying to OCR the diagrams in the PDF.

Basically, going from this LaTeX PDF back into a document isn't really possible (without a lot of work) because of the way PDFs are created. You can think of trying to turn a PDF back into a document kind of like reverse engineering a piece of software (like this other Stack Overflow member mentions here in a thread about going from a PDF back to a LaTeX document): https://stackoverflow.com/a/1620020/10382707

Sometimes if I'm trying to do some simple optical character recognition (OCR) on PDFs I try uploading them to Google Docs to see how their OCR engine works at extracting text from PDF documents. GDocs OCR works well for PDFs that are formatted in a standard way, but it tends to break on things like tables, charts, etc.

If you're interested in turning pictures of math equations into LaTeX you might want to check out this neat tool that some researchers at Harvard created as part of OpenAI's Call for Research It'll turn an image of a math equation into LaTeX notation.

Getting files of trees from PDF (preferably using Python)

1 Answers1