1

I tried extracting text from .doc files. Text were extracted, but the problem is it always outputs with these:

��ࡱ�>�� ln characters.

Here is my code:

    doc=open(input_file,'r')
    read_text_file = doc.readline()
    doc_text = ""
    for line in read_text_file:
        doc_text+=str(line)

    return doc_text

Is there a way to remove or re-encode it to utf-8?

Bazinga
  • 2,456
  • 33
  • 76
  • 2
    `.doc` is probably a proprietary Microsoft Word file. You can not read it like a plain text file. –  Feb 10 '14 at 09:39
  • Could you open them in word, and save them to .txt ? – tk. Feb 10 '14 at 09:42
  • @tk, havent tried that one yet. Is it safe? what if the user doesnt have a word application? – Bazinga Feb 10 '14 at 09:43
  • What are your requirements, can you change the input format to docx : https://pypi.python.org/pypi/docx ? – tk. Feb 10 '14 at 09:49
  • @tk Requirements are to be able to extract text in doc and docx file. I already finished Docx file. – Bazinga Feb 10 '14 at 09:52

1 Answers1

-1

A docx file is just a zip file (try running unzip on it!) containing a bunch of well defined XML and collateral files.

import zipfile
from lxml import etree

def get_word(docx_file_name):
    with open(docx_file_name) as f:
        zip = zipfile.SipFile(f)
        xml_content = zip.read('word/document.xml')
return xml_content

#parse the string containing XML into a usable tree
def get_xml_tree(xml_string):
    return etree.fromstring(xml_string)
#xml has functions for traversing the XML tree, but I used the iter instead that 
#will traverse every node given a starting node ”my_etree”, and return every 
#text node and it’s containing text
def _itertext(self, myetree):
    """goes through the xml tree and extracts nodes"""
    for node in my_etree.iter(tag=etree.Element):
        if self._check_element_is(node, 't'):
            yield(node, node.text)

def _check_element_is(self, element, typr_char):
    word_schema = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
    return element.tag == '{%s}%s' %(word_schema, type_char)

xml_from_file = self.get_word_xml(wod_filename)
xml_tree = self.get_xml_tree(xml_from_file)
for node, txt in self._itertext(xml_tree):
    print txt

find more here

Zuko
  • 2,764
  • 30
  • 30