Decode utf-8 in tarfile

Question

I have tar file which contains multibyte characters (japanese) . I am using libarchive to un tar the file . The filenames inside the tar files are encoded using utf-8 . When I try to untar the file the result always looses the multibyte characters .

I wrote a python script to achieve my result which worked

#!/usr/bin/python27

import tarfile
import pdb
def transform(data):
    u = data.decode('utf8')
    pdb.set_trace()
    #return u.encode('utf8')
    return u

tar = tarfile.open('abc.tar')
for m in tar.getmembers():
    print m.name
    m.name = transform(m.name)
    #print m.name

tar.extractall()

However I want to achieve the same in c++. This is an extract of the cpp code

while (entry = tar_file->nextEntry()) {
    fs::path filepath = path / entry->getFileName();  // loose the utf-8 character s here
    // So I tried the following 
    int wchars_num =  MultiByteToWideChar( CP_ACP , 0 , filepath.string().c_str() , -1, NULL , 0 );
    wchar_t* wstr = new wchar_t[wchars_num];

    //I tried UTF-8 as well in place of CP_ACP
    MultiByteToWideChar( CP_ACP , 0 , filepath.string().c_str() , -1, wstr , wchars_num );
    // But this did not help

There is no such function called "MultiByteToWideChar" in the C++ standard. You must be using your operating system-specific library functions. In any case, UTF-8 is a very simple encoding. There are several, easy Wikipedia articles that describe it, and it shouldn't take more than an hour, or two, to write something quick to decode UTF-8-content into UTF-16 or UTF-32 octets. — Sam Varshavchik, Feb 26 '16 at 13:32
@SamVarshavchik: Are you seriously suggesting to implement your own utf-8 decoder from scratch? `MultiByteToWideChar()` implies Windows. It can be used to decode 8-utf (see [`utf8_decode()`](http://stackoverflow.com/a/3999597/4279)) — jfs, Feb 26 '16 at 14:19

Decode utf-8 in tarfile

0 Answers0