1

I'm looking for a simple way to extract text from excel/word/ppt files. The objective is to index contents in whoosh for search with haystack.

There are some packages like xlrd and pandas that work for excel, but they go way beyond what I need, and I'm not really sure that they will actually just print the cell's unformatted text content straight from the box.

Anybody knows of an easy way around this? My guess is ms office files must be xml-shaped.

Thanks!

A.

misterte
  • 977
  • 1
  • 11
  • 21

1 Answers1

2

I've done this "by hand" before--as it turns out, .(doc|ppt|xls)x files are just zip files which contain .xml files with all of your content. So you can use zipfile and your favorite xml parser to read the contents if you can find no better tool to do it.

Brandon Humpert
  • 322
  • 2
  • 11
  • Hi, thanks for the quick response. This is exactly the kind of answer I was looking for, but could I ask for an example, if you don't mind? – misterte Oct 21 '13 at 17:29
  • 1
    Your best bet is probably to just pop open one of your files in your favorite archive manager and go exploring until you get what you want. There are also some SO questions that may give you a good idea of how it should look when you write the code: http://stackoverflow.com/questions/17888352/clear-new-lines-in-docx http://stackoverflow.com/questions/7021141/how-to-retrieve-author-of-a-office-file-in-python – Brandon Humpert Oct 21 '13 at 18:14
  • Hey, followed your advice and the solution was clear pretty soon. Thanks again. – misterte Oct 21 '13 at 18:39