0

I am currently developing a proprietary PDF parser that can read multiple types of documents with various types of data. Before starting, I was thinking about if reading PowerPoint slides was possible. My employer uses presentation guidelines that requires imagery and background designs - is it possible to build a parser that can read the data from these PowerPoint PDFs without the slide decor getting in the way?

So the workflow would basically be this:

  1. At the end of a project, the project report is delivered in the form of a presentation.
  2. The presentation would be converted to PDF.
  3. The PDF would be submitted to my application.
  4. The application would read the slides and create a data-focused report for quick review.

The goal of the application is to cut down on the amount of reading that needs to be done by a significant amount as some of these presentation reports can be many pages long with not enough time in the day.

Haroldo_OK
  • 6,612
  • 3
  • 43
  • 80
  • MS Office PDF exports might export tagged pdfs. Export with tagged information might improve your results. – mkl Jul 11 '19 at 18:01

2 Answers2

0

Parsing PDFs into structured data is always tricky, as the format is geared towards precise printing, rather than ease of editing or data extraction.

Basically, a PDF contains information like "there's a label with such text at such (x,y) position on a certain page", or things like that.

Basically, you will very likely need some heuristics in order to turn that into structured data.

It will basically be a form of scraping.

Search on your favorite search engine for PDF scraping, or something like that, and it would be a good start.

Also, you may want to look at those similar posts:

Haroldo_OK
  • 6,612
  • 3
  • 43
  • 80
0

A PowerPoint PDF isn't a type of PDF.

There isn't going to be anything natively in the PDF that identifies elements on the page as being 'slide' graphics the originated from a PowerPoint file for example.

You could try building an algorithm that makes decision about content to drop from the created PDF but that would be tricky and seems like the wrong approach to me.

A better approach would be to "Export" the PPT to text first, e.g. in Microsoft PowerPoint Export it to a RTF file so you get all of the text out and use that directly or then convert that to PDF.

JosephA
  • 1,187
  • 3
  • 13
  • 27
  • I understand that, I was just calling it that for simplicity. Thank you for the idea. I'll try converting the presentation file first. – Mashiyath Haque Jul 15 '19 at 03:25