-1

I have to write a program to decode a file (which is encoded using CP437) by replacing each symbol's Unicode according to the CP437 table, then converting it to UTF-8 and printing the output to a file.

I have two files - an input file which contains a long text with both normal characters and some weird characters (in the result file those weird characters will be replaced by various dashes), and a CP437 file which contains 256 lines of pairs (the first part is a decimal number, the second one is Unicode, for example, 73 0049).

This is how I'm trying to solve this problem:

  1. Open the input file using the 'RB' flag
  2. Since I'm opening the file using 'RB', I read every symbol as bytes and then store it in the 'text' list
  3. After I'm done reading the file, I loop through the text list
  4. During the loop, I get the decimal value of the symbol
  5. I get the Unicode from the CP437.txt file using the decimal value
  6. I convert the Unicode to 0s and 1s
  7. I convert the binary representation of the Unicode to UTF-8 and receive 0s and 1s back
  8. I convert those UTF-8 0s and 1s to bytes and write them to the results file that is opened with the 'WB' flag

Also, if the length of the UTF-8 0s and 1s is longer than 8, then I split it every 8 characters and then convert those into bytes (I'm not sure if this is correct)

The main problem is that when I try to write the results, I get a lot of gibberish characters and I'm not sure where is the problem. ANY help is appreciated, I've been stuck with this assignment for a while now and just can't figure out what's the problem.

def convertBinToHex(binary):
    binToHex = hex(int(binary, 2))
    temp = list(binToHex)
    temp = temp[2:]
    binToHex = "".join(temp).upper()
    return binToHex


def convertUnicodeToUTF(unicodeBin, symbolDecimal, returnBin):
    # https://stackoverflow.com/questions/6240055/manually-converting-unicode-codepoints-into-utf-8-and-utf-16
    bytesCount = 0
    if int("0000", 16) <= symbolDecimal <= int("007F", 16):
        if returnBin:
            return unicodeBin
        return convertBinToHex(unicodeBin)
    elif int("0080", 16) <= symbolDecimal <= int("07FF", 16):
        bytesCount = 2
    elif int("0800", 16) <= symbolDecimal <= int("FFFF", 16):
        bytesCount = 3
    elif int("10000", 16) <= symbolDecimal <= int("10FFFF", 16):
        bytesCount = 4
    else:
        return

    if bytesCount == 2:
        template = ['1', '1', '0', 'x', 'x', 'x', 'x', 'x', '1', '0', 'x', 'x', 'x', 'x', 'x', 'x']
    elif bytesCount == 3:
        template = ['1', '1', '1', '0', 'x', 'x', 'x', 'x', '1', '0', 'x', 'x', 'x', 'x', 'x', 'x', '1', '0', 'x', 'x',
                    'x',
                    'x', 'x', 'x']
    elif bytesCount == 4:
        template = ['1', '1', '1', '1', '0', 'x', 'x', 'x', '1', '0', 'x', 'x', 'x', 'x', 'x', 'x', '1', '0', 'x', 'x',
                    'x',
                    'x', 'x', 'x', '1', '0', 'x', 'x', 'x', 'x', 'x', 'x']
    else:
        return

    results = []
    unicodeList = list(unicodeBin)
    counter = len(unicodeList) - 1

    for el in reversed(template):
        if el == 'x':
            if counter >= 0:
                results.append(unicodeList[counter])
                counter -= 1
            else:
                results.append('0')
        elif el == '0':
            results.append('0')
        else:
            results.append('1')

    results.reverse()
    results = "".join(results)

    if returnBin:
        return results
    else:
        return convertBinToHex(results)



codePage = {}
with open("CP437.txt") as f:
    for line in f:
        (key, val) = line.split()
        codePage[key] = val

text = []

with open("386intel.txt", 'rb') as f:
    while True:
        c = f.read(1)
        if c:
            # Converts bytes to bits (string)
            text.append("{:08b}".format(int(c.hex(), 16)))
        if not c:
            print("End of file")
            break


bytesString = 0
bytesStringInt = 0
resultFile = open("rez.txt", "wb")

for item in text:
    decimalValue = int(item, 2)
    newUnicode = codePage[str(decimalValue)]
    unicodeToBin = "{0:08b}".format(int(newUnicode, 16))
    bytesString = convertUnicodeToUTF(unicodeToBin, decimalValue, True)
    if len(bytesString) > 8:
        bytesStringSplit = [bytesString[i:i + 8] for i in range(0, len(bytesString), 8)]
        for x in bytesStringSplit:
            bytesStringInt = int(x, 2)
            resultFile.write(bytes([bytesStringInt]))
            # print(bytes([bytesStringInt]))
    else:
        bytesStringInt = int(bytesString, 2)
        resultFile.write(bytes([bytesStringInt]))
        # print(bytes([bytesStringInt]))
SR1000
  • 1
  • 1

1 Answers1

0

Untested because you neglected to provide the input files:

#!/usr/bin/env perl
use strict;
use warnings;
use autodie;

my @cp;
{
    open my $fh, '<', 'CP437.txt';
    while (my $line = readline $fh) {
        chomp $line;
        my ($k, $v) = split ' ', $line;
        $cp[$k] = chr hex $v;
    }
}
{
    open my $in, '<:raw', '386intel.txt';
    open my $out, '>:encoding(UTF-8)', '386intel.txt.utf8';
    while (my $line = readline $in) {
        $out->print(
            join '',            # 5. join characters into string
            map {               # 2. loop over octets
                $cp[            # 4. look up character corresponding to
                                    # octet numeric value
                    ord         # 3. numeric value of octet
                ]
            }
            split '', $line     # 1. split line into octets
        );
    }
}

The program is quite easy to understand with only 10 lines of significant code (and also easy to port to Python, if needed).


If the file CP437.txt follows the standard, then it simply becomes:

› piconv -f CP437 -t UTF-8 < 386intel.txt > 386intel.txt.utf8

In case the assignment really involves manual encoding to UTF-8 instead of using a library, then substitute at the place in the code where the chr function is.

daxim
  • 39,270
  • 4
  • 65
  • 132