1

I'm currently using a combination of OpenOffice macros and a pdf2text program to extract text and would like to find an easier, more efficient way getting the text out of a PowerPoint file.

I've tried using the Apache POI library and have not had much luck, encountered numerous exceptions within the library when trying to process the files I'm looking at and don't particularly want to sift through the source code of the library.

Is there an easy way to do this without using the aforementioned library?

3 Answers3

2

If you have MS Office and you save the PPT in the RTF (Rich Text Format), it contains just the text from the presentation. You could then open the file in any editor that understands RTF files and save it as a text (TXT) file.

I expect this to work from Open Office too.

Since you talk of API, this may not be the way to go for you but maybe it will give you newer ideas on getting there. Say, you use multiple macros to do the conversion in stages...

Edit: I got curious and did a short google search

This is what i found on one of the www.openoffice.org pages

As people in this thread have pointed out, retrieving text from an OO document isn't hard since it's just zipped xml that can be parsed with a perl script. The problem is getting Microsoft Powerpoint documents into a zipped XML format in the first place.

I've found that File -> Wizards -> Document Convertor does exactly that. Just tell it you want to convert Powerpoint documents, not templates, point it to your source directory and where you want it to spit out the result and you're away.

I then find unzip -p $file.sxi content.xml | perl -p -e "s/<[^>]>/\n/g;s/ +//;s/\n\n/\n/g;" -w

works rather well for extracting the text.

Sorry, i don't have Open Office handy to try any of that out.

nik
  • 13,254
  • 3
  • 41
  • 57
  • saving as RTF doesn't work. it seems to save only an index of the slides in the file – ekkis Dec 17 '12 at 18:40
  • I needed a wildcard in the first substitution: `s/<[^>]*>/\n/g`. Also updated the last substitution to `s/^(\s*\r?\n){2,}/\n/gm`, using the multi-line modifier, and allowing optional carriage return as per answer and comment https://stackoverflow.com/questions/4475042/replacing-multiple-blank-lines-with-one-blank-line-using-regex-search-and-replac#4475642 . – Randall Whitman Apr 12 '19 at 18:00
  • To clarify, the `unzip`+`perl` recipe produces both the slide contents and the presenter notes (which I required), rather than only the slide text (which can be gotten in GUI LibreOffice: View - Outline). – Randall Whitman Apr 12 '19 at 18:08
1

pptx files are relatively easy to deal with, because they are just zipped xml - you can just unzip them and then strip all the xml tags from the content of the files in the 'ppt/slides' subdirectory of the unzipped stuff, yielding most of the pertinent text.

ppt files are a whole other ballgame, and the process is rendered even more painful because the canonical tool, catppt from the catdoc package, is susceptible to a buffer overflow that makes it nearly useless (it segfaults on a large percentage of ppt files).

pokute
  • 147
  • 1
  • 2
0

LibreOffice-5 File - Export - HTML includes both slide contents and presenter notes. Then, open the .html file in Firefox or other browser, and File - Save Page As - Text File (or utility such as pandoc -o file.txt file.html).

Randall Whitman
  • 411
  • 2
  • 13