3

I need to decide when (not) to convert a text file based on the known file encoding and the desired output encoding.

If the text is US-ASCII, I don't need to convert it if the output encoding is ASCII, UTF-8, Latin1, ...
Obviously I need to convert a US-ASCII file to UTF-16 or UTF-32.

A list of standard encodings exists at
http://www.iana.org/assignments/character-sets/character-sets.xml

A conversion is necessary if:

  • the minimal character size is > 1 byte or
  • the first 127 code points are not the same as US-ASCII.

I'd like to know:

  • Is there a similar list with details (bytelenght, ASCII-compatibility) about the implementation of each encoding?

EDIT
I already found an answer to the question

  • Are all 8-or-variable8-bit-based codecs a superset of ASCII?
    • In other words: Can US-ASCII be interpreted as any 8-or-variable8-bit-based encoding?

here: Character set that is not a superset of ASCII
Instead, it would be helpful to know:

  • Is there a list of character sets which are supersets of ASCII?

This looks promising:
mime.charsets - list of character sets which are ASCII supersets,
but I couldn't find an actual mime.charsets file.

Community
  • 1
  • 1
Martin Hennings
  • 16,418
  • 9
  • 48
  • 68
  • 1
    You want this purely to decide whether or not something *needs conversion*? Why not simply do the conversion; if nothing needed to be changed, nothing will happen. Not quite understanding what scenario something like this would be useful in. – deceze Oct 30 '13 at 11:32
  • @deceze I convert a bunch of files replacing the old files. I don't want to touch the files that don't need to be converted. Sounds reasonable? – Martin Hennings Oct 30 '13 at 11:41
  • 1
    What about converting them, testing if they're identical to the original and discarding the conversion if so? Sounds a lot simpler to me. – deceze Oct 30 '13 at 11:43
  • @deceze I think we should return to the original question "Is there a list of character sets which are supersets of ASCII?" – Martin Hennings Oct 30 '13 at 12:04

1 Answers1

3

An alternative approach is to decode the bytes 0x00 - 0x7F in the given encoding, and check that the characters match ASCII. For example, in Python 3.x:

def is_ascii_superset(encoding):
    for codepoint in range(128):
       if bytes([codepoint]).decode(encoding, 'ignore') != chr(codepoint):
           return False
    return True

This gives:

>>> is_ascii_superset('US-ASCII')
True
>>> is_ascii_superset('windows-1252')
True
>>> is_ascii_superset('ISO-8859-15')
True
>>> is_ascii_superset('UTF-8')
True
>>> is_ascii_superset('UTF-16')
False
>>> is_ascii_superset('IBM500') # a variant of EBCDIC
False

EDIT: Get US-ASCII compatibility for each encoding supported by your Qt version in C++:

#include <QTextCodec>
#include <QMap>

typedef enum
{
    eQtCodecUndefined,
    eQtCodecAsciiIncompatible,
    eQtCodecAsciiCompatible,
} tQtCodecType;

QMap<QByteArray, tQtCodecType> QtCodecTypes()
{
    QMap<QByteArray, tQtCodecType> CodecTypes;
    // How to test Qt's interpretation of ASCII data?
    QList<QByteArray> available = QTextCodec::availableCodecs();
    QTextCodec *referenceCodec = QTextCodec::codecForName("UTF-8"); // because Qt has no US-ASCII, but we only test bytes 0-127 and UTF-8 is a superset of US-ASCII
    if(referenceCodec == 0)
    {
        qDebug("Unable to get reference codec 'UTF-8'");
        return CodecTypes;
    }
    for(int i = 0; i < available.count(); i++)
    {
        const QByteArray name = available.at(i);
        QTextCodec *currCodec = QTextCodec::codecForName(name);
        if(currCodec == NULL)
        {
            qDebug("Unable to get codec for '%s'", qPrintable(QString(name)));
            CodecTypes.insert(name, eQtCodecUndefined);
            continue;
        }
        tQtCodecType type = eQtCodecAsciiCompatible;
        for(uchar j = 0; j < 128; j++) // UTF-8 == US-ASCII in the lower 7 bit
        {
            const char c = (char)j; // character to test < 2^8
            QString sRef, sTest;
            sRef = referenceCodec->toUnicode(&c, 1); // convert character to UTF-16 (QString internal) assuming it is ASCII (via UTF-8)
            sTest = currCodec->toUnicode(&c, 1); // convert character to UTF-16 assuming it is of type [currCodec]
            if(sRef != sTest) // compare both UTF-16 representations -> if they are equal, these codecs are transparent for Qt
            {
                type = eQtCodecAsciiIncompatible;
                break;
            }
        }
        CodecTypes.insert(name, type);
    }

    return CodecTypes;
}
phuclv
  • 37,963
  • 15
  • 156
  • 475
dan04
  • 87,747
  • 23
  • 163
  • 198
  • You're right, thinking about it, the criteria for being an ASCII superset are quite simple, so I can create that list myself - I'll add my C++ implementation to your answer for further reference as soon as it works. – Martin Hennings Oct 31 '13 at 07:16
  • Oh, very interesting solution. Just checking the first 128 bytes of the encoding I've been given is just as easy as checking it against a list. – Nyerguds Aug 19 '16 at 12:32