1

1I am trying to read-in some old text document file using python.

This file written in 1995 and has a file extension of ".WPF"

I had tried

f = open('/Users/zachary/Downloads/2R.WPF', mode = 'r')  
print(f.read())

If I open it up through libreoffice, it well appears.

Any hint how to process text in .WPF using python?

linke address: WTO Dispute Settlment DS2 Panel Report

Someone had marked it as the duplicated under the notion that the file is just wrongly named in WPF, however, it looks it's not a .doc file since the textract.process returns the error "it's not .doc"

snapper
  • 997
  • 1
  • 12
  • 15
  • 2
    Is it a [Counter-Strike PODBot Waypoint File](https://fileinfo.com/extension/pwf)? – Laurent LAPORTE Feb 18 '18 at 13:13
  • @LaurentLAPORTE Thx. No actually it's a my typo. It's WPF not PWF sorry agian – snapper Feb 18 '18 at 13:18
  • 2
    So, is it a [Windows Presentation Foundation](https://learn.microsoft.com/en-us/dotnet/framework/wpf/), could you explain what you have, please? What do you want to achieve? – Laurent LAPORTE Feb 18 '18 at 13:20
  • @LaurentLAPORTE I attached link address. my ultimate goal is read in as a string in python then parse the Table of Contents of each PWF file and hold them into a form of dictionary with page number as a value : such as, {'introduction': 1, 'factual aspect': 4, 'main argument' :9, ...} – snapper Feb 18 '18 at 13:33
  • 2
    You cannot bluntly open and read any 3rd party binary file type and expect it to return plain readable text. Use a search engine to find a library to do so, or (advanced) look up its specifications and write it yourself. – Jongware Feb 18 '18 at 13:40
  • 1
    The file is actually a normal Microsoft Word `.doc` file with a wrong name. – Lukas Körfer Feb 18 '18 at 13:44
  • Possible duplicate of [Reading/Writing MS Word files in Python](https://stackoverflow.com/questions/188444/reading-writing-ms-word-files-in-python) – Lukas Körfer Feb 18 '18 at 13:45
  • @lu.koerfer how can I check that? If you teach me how to do it, will never forget later – snapper Feb 18 '18 at 13:57
  • @lu.koerfer I changed file extension as a .doc and tried textract, but it returns "it's not a doc file" – snapper Feb 18 '18 at 13:59
  • 1
    My guess is that the `.wpf` file is a Word Perfect form, the form variant of a `.wpd` Word Perfect document. I don't know of any Python library that will read those formats. Reading it into LibreOffice or Word and writing it out in a format that Python can read is probably your best bet, unless you're willing to research the Word Perfect formats and write your own reader library. – ottomeister Feb 19 '18 at 01:47
  • @ottomeister write my own reader.. quite tough but sound challengeable – snapper Feb 19 '18 at 02:20
  • It *is* a WordPerfect file, per the magic header `FF,"WPC"`. I can't determine the exact version, but it's v.5 or later. I recognize *some* of the binary structures but it must be 20 yrs since I last saw one of these, and all my own handling software has long gone by now. – Jongware Feb 19 '18 at 12:56
  • @usr2564301 so What's your recommendation for me to process this data? – snapper Feb 19 '18 at 14:24

1 Answers1

3

As can be determined from the very first bytes, that file is a WordPerfect 5.x file (where x is 0, 1, or possibly 2), a file format dating back to around 1989.

According to its description, the Tika interface for Python should be able to convert this for you, but as far as word processor formats go, these older WordPerfect files are fairly easy to decode, without anything more than a plain Python installation.

The format consists of a large header (which, among other information, defines the printer that the document was formatted for, the list of fonts used, and some basic "style" information – I chose to skip it entirely in my program below), followed by plain text which is interspersed with binary codes which govern the formatting.

The binary codes appear in 3 distinct forms:

  1. single-byte: 0x0A is a Return, 0xA9 is a breaking hyphen, 0xAA is a breaking hyphen when the line is broken at that position, and so on.
  2. multi-byte, fixed length: the byte is followed by one or more specifications. For example, 0xC0 is a "special character code". It is followed by the character set index and the index of the actual character inside that set. The final byte of a fixed-length code is always the starting byte again.
  3. multi-byte, variable length: the code determines a main category of formatting and is followed by a second to indicate a subcategory; after that, 2 bytes in little-endian indicate the length of the following data (excluding the first 4 bytes). This code always ends with the same items in reversed order: 2 bytes (little-endian) for the length, the subcategory, then the main category.

Codes between 0x00..0x1F and 0x7F..0xBF are single-byte control codes (not all are used). Codes between 0xC0..0xCF are fixed-length control codes, with various predefined lengths. Codes from 0xD0 onward are always variable-length codes.

With only this information, it's already possible to extract the plain text of this document, and just skip all possible formatting. Comparing the output codes against the PDF from the same site reveals the meaning of some of the codes, such as the various types of Return, the Tab, and plain text formatting such as bold and italics.

In addition, footnotes are stored in-line inside a variable-length code, so this needs some form of re-entrant parser.

The following Python 3 program is a basic framework which you can use as-is (it extracts the text, with a hint for the footnotes), or you can enable the commented-out lines at the bottom and find further information on parsing more of the formatting code.

# -*- coding: utf-8 -*-
import sys

WPType_None = 0
WPType_Text = 1
WPType_Byte = 2
WPType_Fixed = 3
WPType_Variable = 4

plain_remap = {10:'\n', 11:' ', 13:' ', 160:' ', 169:'-', 170:'-'}

WpCharacterSet = { 0x0121:'à', 0x0406:'§', 0x041c:u'’', 0x041d:u'‘', 0x041f:'”', 0x0420:'“', 0x0422:'—' }

textAttributes = [
    "Extra Large",
    "Very Large",
    "Large",
    "Small",
    "Fine",
    "Superscript",
    "Subscript",
    "Outline",
    "Italic",
    "Shadow",
    "Redline",
    "Double Underline",
    "Bold",
    "Strikeout",
    "Underline",
    "SmallCaps" ]

class WPElem:
    def __init__(self, type=WPType_None, data = [], code=None):
        self.type = type
        self.code = code
        if type == WPType_Text:
            self.data = data
        else:
            self.data = data

class WordPerfect:
    def __init__(self, filename):
        with open(filename, "rb") as file:
            self.data = bytearray(file.read())
        sig = ''.join(chr(x) for x in self.data[1:4])
        if self.data[0] != 255 or sig != 'WPC':
            raise TypeError('Invalid file type')
        self.data_start = self.data[4]+256*(self.data[5]+256*(self.data[6]+256*self.data[7]))
        self.length = len(self.data)
        self.elements = []
        self.parse (self.data_start, self.length)

    def parse (self, start,maxlength):
        pos = start
        while pos < maxlength:
            byte = self.data[pos]
            if byte in plain_remap:
                byte = ord(plain_remap[byte])
            if byte == 10 or byte >= 32 and byte <= 126:
                if len(self.elements) == 0 or self.elements[-1].type != WPType_Text:
                    self.elements.append(WPElem(WPType_Text, ''))
                self.elements[-1].data += chr(byte)
                pos += 1
            elif byte == 12:
                self.elements.append(WPElem(WPType_Text, '\n\n'))
                pos += 1
            elif byte == 0x8c:  # [HRt/Pg Break]
                self.elements.append(WPElem(WPType_Text, '\n'))
                pos += 1
            elif byte == 0x8d:  # [Ftn Num]
                self.elements.append(WPElem(WPType_Text, '[Ftn Num]'))
                pos += 1
            elif byte == 0x99:  # [HRt/Top of Pg]
                self.elements.append(WPElem(WPType_Text, '\n'))
                pos += 1
            elif byte == 0xc0 and pos+3 < maxlength and self.data[pos+3] == 0xc0:
                wpchar = self.data[pos+1]+256*self.data[pos+2]
                if wpchar in WpCharacterSet:
                    self.elements.append(WPElem(WPType_Text, WpCharacterSet[wpchar]))
                else:
                    self.elements.append(WPElem(WPType_Text, '{CHAR:%04X}' % wpchar))
                pos += 4
            elif byte == 0xc1 and self.data[pos+8] == 0xc1:
                # self.elements.append(WPElem(WPType_Fixed, self.data[pos:pos+7]))
                self.elements.append(WPElem(WPType_Text, '\t'))
                pos += 9
            elif byte == 0xc2 and self.data[pos+10] == 0xc2:
                # self.elements.append(WPElem(WPType_Fixed, self.data[pos:pos+9]))
                self.elements.append(WPElem(WPType_Text, '\t'))
                pos += 11
            elif byte == 0xc3:
                self.elements.append(WPElem(WPType_Fixed, self.data[pos:pos+1], '%s On' % textAttributes[self.data[pos+1]]))
                pos += 3
            elif byte == 0xc4:
                self.elements.append(WPElem(WPType_Fixed, self.data[pos:pos+1], '%s Off' % textAttributes[self.data[pos+1]]))
                pos += 3
            elif byte == 0xc6:
                self.elements.append(WPElem(WPType_Fixed, self.data[pos:pos+5]))
                pos += 6
            elif byte == 0xd6 and self.data[pos+1] == 0:    # Footnote
                self.elements.append(WPElem(WPType_Text, '[Footnote:'))
                length = self.data[pos+2]+256*self.data[pos+3]
                self.parse (pos+0x13, pos+length)
                pos += 4+length
                self.elements.append(WPElem(WPType_Text, ']'))

            else:
                self.elements.append(WPElem(WPType_Byte, [byte]))
                if byte >= 0xd0 and pos+4 <= maxlength:
                    length = self.data[pos+2]+256*self.data[pos+3]
                    if pos+4+length <= self.length:
                        if pos+4+length <= self.length and self.data[pos+4+length-1] == byte:
                            self.elements[-1].type = WPType_Variable
                            self.elements[-1].data += [x for x in self.data[pos+1:pos+length]]
                            pos += 4+length
                        else:
                            pos += 1
                    else:
                        pos += 1
                else:
                    pos += 1


if len(sys.argv) != 2:
    print("usage: read_wpf.py [suitably ancient WordPerfect file]")
    sys.exit(1)

wpdata = WordPerfect (sys.argv[1])

for i in wpdata.elements:
    if i.type == WPType_Text:
        print (i.data, end='')
'''
    elif i.code:
        print ('[%s]' % i.code, end='')
    elif i.type == WPType_Variable:
        print ('[%02X:%d]' % (i.data[0],i.data[1]), end='')
    else:
        print ('[%02X]' % i.data[0], end='')
'''

Running it prints out the text to the console:

$ python3 read_wpf.py 2R.WPF
        RESTRICTED
World Trade WT/DS2/R
    29 January 1996
Organization    
    (96-0326)
(.. several thousands of lines omitted for brevity ..)
8.2 The Panel recommends that the Dispute Settlement Body request the
United States to bring this part of the Gasoline Rule into conformity
with its obligations under the General Agreement.

and you can either rewrite the program to store this into a plain text file, or redirect via your console into a file.

I've only added a translation for the handful of special characters that appear in the sample file. For a full-featured version, you'd need to look up a 90s data sheet somewhere, and provide Unicode translations for each of the thousands of characters.

Similarly, I've only 'parsed' some of the special formatting codes, and to a very limited extend. If you need to be able to extract particular formatting – say, tab settings, margins, font sizes, et cetera –, you must locate a full specification of the file format and add specific parsing code for these functions.

Jongware
  • 22,200
  • 8
  • 54
  • 100