As can be determined from the very first bytes, that file is a WordPerfect 5.x file (where x is 0
, 1
, or possibly 2
), a file format dating back to around 1989.
According to its description, the Tika interface for Python should be able to convert this for you, but as far as word processor formats go, these older WordPerfect files are fairly easy to decode, without anything more than a plain Python installation.
The format consists of a large header (which, among other information, defines the printer that the document was formatted for, the list of fonts used, and some basic "style" information – I chose to skip it entirely in my program below), followed by plain text which is interspersed with binary codes which govern the formatting.
The binary codes appear in 3 distinct forms:
- single-byte:
0x0A
is a Return, 0xA9
is a breaking hyphen, 0xAA
is a breaking hyphen when the line is broken at that position, and so on.
- multi-byte, fixed length: the byte is followed by one or more specifications. For example,
0xC0
is a "special character code". It is followed by the character set index and the index of the actual character inside that set. The final byte of a fixed-length code is always the starting byte again.
- multi-byte, variable length: the code determines a main category of formatting and is followed by a second to indicate a subcategory; after that, 2 bytes in little-endian indicate the length of the following data (excluding the first 4 bytes). This code always ends with the same items in reversed order: 2 bytes (little-endian) for the length, the subcategory, then the main category.
Codes between 0x00..0x1F
and 0x7F..0xBF
are single-byte control codes (not all are used). Codes between 0xC0..0xCF
are fixed-length control codes, with various predefined lengths. Codes from 0xD0
onward are always variable-length codes.
With only this information, it's already possible to extract the plain text of this document, and just skip all possible formatting. Comparing the output codes against the PDF from the same site reveals the meaning of some of the codes, such as the various types of Return, the Tab, and plain text formatting such as bold and italics.
In addition, footnotes are stored in-line inside a variable-length code, so this needs some form of re-entrant parser.
The following Python 3 program is a basic framework which you can use as-is (it extracts the text, with a hint for the footnotes), or you can enable the commented-out lines at the bottom and find further information on parsing more of the formatting code.
# -*- coding: utf-8 -*-
import sys
WPType_None = 0
WPType_Text = 1
WPType_Byte = 2
WPType_Fixed = 3
WPType_Variable = 4
plain_remap = {10:'\n', 11:' ', 13:' ', 160:' ', 169:'-', 170:'-'}
WpCharacterSet = { 0x0121:'à', 0x0406:'§', 0x041c:u'’', 0x041d:u'‘', 0x041f:'”', 0x0420:'“', 0x0422:'—' }
textAttributes = [
"Extra Large",
"Very Large",
"Large",
"Small",
"Fine",
"Superscript",
"Subscript",
"Outline",
"Italic",
"Shadow",
"Redline",
"Double Underline",
"Bold",
"Strikeout",
"Underline",
"SmallCaps" ]
class WPElem:
def __init__(self, type=WPType_None, data = [], code=None):
self.type = type
self.code = code
if type == WPType_Text:
self.data = data
else:
self.data = data
class WordPerfect:
def __init__(self, filename):
with open(filename, "rb") as file:
self.data = bytearray(file.read())
sig = ''.join(chr(x) for x in self.data[1:4])
if self.data[0] != 255 or sig != 'WPC':
raise TypeError('Invalid file type')
self.data_start = self.data[4]+256*(self.data[5]+256*(self.data[6]+256*self.data[7]))
self.length = len(self.data)
self.elements = []
self.parse (self.data_start, self.length)
def parse (self, start,maxlength):
pos = start
while pos < maxlength:
byte = self.data[pos]
if byte in plain_remap:
byte = ord(plain_remap[byte])
if byte == 10 or byte >= 32 and byte <= 126:
if len(self.elements) == 0 or self.elements[-1].type != WPType_Text:
self.elements.append(WPElem(WPType_Text, ''))
self.elements[-1].data += chr(byte)
pos += 1
elif byte == 12:
self.elements.append(WPElem(WPType_Text, '\n\n'))
pos += 1
elif byte == 0x8c: # [HRt/Pg Break]
self.elements.append(WPElem(WPType_Text, '\n'))
pos += 1
elif byte == 0x8d: # [Ftn Num]
self.elements.append(WPElem(WPType_Text, '[Ftn Num]'))
pos += 1
elif byte == 0x99: # [HRt/Top of Pg]
self.elements.append(WPElem(WPType_Text, '\n'))
pos += 1
elif byte == 0xc0 and pos+3 < maxlength and self.data[pos+3] == 0xc0:
wpchar = self.data[pos+1]+256*self.data[pos+2]
if wpchar in WpCharacterSet:
self.elements.append(WPElem(WPType_Text, WpCharacterSet[wpchar]))
else:
self.elements.append(WPElem(WPType_Text, '{CHAR:%04X}' % wpchar))
pos += 4
elif byte == 0xc1 and self.data[pos+8] == 0xc1:
# self.elements.append(WPElem(WPType_Fixed, self.data[pos:pos+7]))
self.elements.append(WPElem(WPType_Text, '\t'))
pos += 9
elif byte == 0xc2 and self.data[pos+10] == 0xc2:
# self.elements.append(WPElem(WPType_Fixed, self.data[pos:pos+9]))
self.elements.append(WPElem(WPType_Text, '\t'))
pos += 11
elif byte == 0xc3:
self.elements.append(WPElem(WPType_Fixed, self.data[pos:pos+1], '%s On' % textAttributes[self.data[pos+1]]))
pos += 3
elif byte == 0xc4:
self.elements.append(WPElem(WPType_Fixed, self.data[pos:pos+1], '%s Off' % textAttributes[self.data[pos+1]]))
pos += 3
elif byte == 0xc6:
self.elements.append(WPElem(WPType_Fixed, self.data[pos:pos+5]))
pos += 6
elif byte == 0xd6 and self.data[pos+1] == 0: # Footnote
self.elements.append(WPElem(WPType_Text, '[Footnote:'))
length = self.data[pos+2]+256*self.data[pos+3]
self.parse (pos+0x13, pos+length)
pos += 4+length
self.elements.append(WPElem(WPType_Text, ']'))
else:
self.elements.append(WPElem(WPType_Byte, [byte]))
if byte >= 0xd0 and pos+4 <= maxlength:
length = self.data[pos+2]+256*self.data[pos+3]
if pos+4+length <= self.length:
if pos+4+length <= self.length and self.data[pos+4+length-1] == byte:
self.elements[-1].type = WPType_Variable
self.elements[-1].data += [x for x in self.data[pos+1:pos+length]]
pos += 4+length
else:
pos += 1
else:
pos += 1
else:
pos += 1
if len(sys.argv) != 2:
print("usage: read_wpf.py [suitably ancient WordPerfect file]")
sys.exit(1)
wpdata = WordPerfect (sys.argv[1])
for i in wpdata.elements:
if i.type == WPType_Text:
print (i.data, end='')
'''
elif i.code:
print ('[%s]' % i.code, end='')
elif i.type == WPType_Variable:
print ('[%02X:%d]' % (i.data[0],i.data[1]), end='')
else:
print ('[%02X]' % i.data[0], end='')
'''
Running it prints out the text to the console:
$ python3 read_wpf.py 2R.WPF
RESTRICTED
World Trade WT/DS2/R
29 January 1996
Organization
(96-0326)
(.. several thousands of lines omitted for brevity ..)
8.2 The Panel recommends that the Dispute Settlement Body request the
United States to bring this part of the Gasoline Rule into conformity
with its obligations under the General Agreement.
and you can either rewrite the program to store this into a plain text file, or redirect via your console into a file.
I've only added a translation for the handful of special characters that appear in the sample file. For a full-featured version, you'd need to look up a 90s data sheet somewhere, and provide Unicode translations for each of the thousands of characters.
Similarly, I've only 'parsed' some of the special formatting codes, and to a very limited extend. If you need to be able to extract particular formatting – say, tab settings, margins, font sizes, et cetera –, you must locate a full specification of the file format and add specific parsing code for these functions.