0

I have a C program test.elf that should process UTF-8 encoded file and print it inside a terminal on a UTF-8 system. Now someone gave me a file components.csv which is ISO-8859-1 encoded. And I encountered problems.

My system's encoding can be checked in a terminal and it is indeed UTF-8:

[ziga@localhost ~]$ echo $LANG
en_US.UTF-8

I can also check or guess (!) file's encoding which is one of the ISO-8859-{1,2,3,4,5,6,7,8,9,10,11,13,14,15} (source):

[ziga@localhost ~]$ file components.csv 
components.csv: ISO-8859 text, with very long lines, with CRLF line terminators

If I read this file directly using cat and limit output to first couple of lines using head, I see the first unknown character . This is expected, because system is in UTF-8 that can handle ASCII characters, but not extended ASCII characters (source) where probably belongs to:

[ziga@localhost ~]$ cat components.csv | head -n4
id_articolo,codice,descrizione,esistenza,disponibilita,qta_rim_iniziale,qta_caricata,qta_scaricata,qta_ord_clienti,qta_ord_fornitori,val_rim_iniziale,val_caricato,val_scaricato,ultimo_costo,c_scorta_min,c_cod_fornitore,c_des_fornitore,c_prd_qta_avanz,c_prd_qta_wip,prezzo_listino,codice,qta_altri_carichi,qta_altri_scarichi
41,15MQ040N,Diodo schottky 3A 40V SMA,6755,0000,6755,0000,6755,0000,0,0,0,0,0,0,0,0,0,,,0,0,0,NR,0,0
49,24LC256-I/SN,Memoria flash 8 pin SOIC-8 256kbit,22,0000,22,0000,22,0000,0,0,0,0,16,0600,0,0,0,0,57010035,EBV Elektronik,0,0,0,NR,0,0
2156,24LC512-I/SN,"Memoria EEPROM I2C 64kx8bit 2,5�5,5V 400kHz SOIC8",92,0000,92,0000,92,0000,0,0,0,0,50,6000,0,0,0,0,57010274,GSE s.r.l.,0,0,0,NR,0,0

Now If I process this file directly with my program, the program will end it's execution at this exact character and this is also expected:

[ziga@localhost ~]$ ./test.elf components.csv a 
―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
1:
    Commencing import procedure of file "components.csv" into SQLite database "a".
―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
2:
    CSV file "components.csv" found.
―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
3:
    Printing inputed file "components.csv":

id_articolo,codice,descrizione,esistenza,disponibilita,qta_rim_iniziale,qta_caricata,qta_scaricata,qta_ord_clienti,qta_ord_fornitori,val_rim_iniziale,val_caricato,val_scaricato,ultimo_costo,c_scorta_min,c_cod_fornitore,c_des_fornitore,c_prd_qta_avanz,c_prd_qta_wip,prezzo_listino,codice,qta_altri_carichi,qta_altri_scarichi
41,15MQ040N,Diodo schottky 3A 40V SMA,6755,0000,6755,0000,6755,0000,0,0,0,0,0,0,0,0,0,,,0,0,0,NR,0,0
49,24LC256-I/SN,Memoria flash 8 pin SOIC-8 256kbit,22,0000,22,0000,22,0000,0,0,0,0,16,0600,0,0,0,0,57010035,EBV Elektronik,0,0,0,NR,0,0
2156,24LC512-I/SN,"Memoria EEPROM I2C 64kx8bit 2,5

But now I will convert the file's encoding and create a new file components-utf8.csv in UTF-8 encoding. I tried this procedure multiple times for every ISO-8859-{1,2,3,4,5,6,7,8,9,10,11,13,14,15} encoding and the solution below yields best results:

iconv -f ISO-8859-1 -t UTF-8 components.csv > components-utf8.csv

If I process new file using cat and head, unknown character now renders fine as ÷:

[ziga@localhost ~]$ cat components-utf8.csv | head -n4
id_articolo,codice,descrizione,esistenza,disponibilita,qta_rim_iniziale,qta_caricata,qta_scaricata,qta_ord_clienti,qta_ord_fornitori,val_rim_iniziale,val_caricato,val_scaricato,ultimo_costo,c_scorta_min,c_cod_fornitore,c_des_fornitore,c_prd_qta_avanz,c_prd_qta_wip,prezzo_listino,codice,qta_altri_carichi,qta_altri_scarichi
41,15MQ040N,Diodo schottky 3A 40V SMA,6755,0000,6755,0000,6755,0000,0,0,0,0,0,0,0,0,0,,,0,0,0,NR,0,0
49,24LC256-I/SN,Memoria flash 8 pin SOIC-8 256kbit,22,0000,22,0000,22,0000,0,0,0,0,16,0600,0,0,0,0,57010035,EBV Elektronik,0,0,0,NR,0,0
2156,24LC512-I/SN,"Memoria EEPROM I2C 64kx8bit 2,5÷5,5V 400kHz SOIC8",92,0000,92,0000,92,0000,0,0,0,0,50,6000,0,0,0,0,57010274,GSE s.r.l.,0,0,0,NR,0,0

If I process new file with my program, it executes from the start till the end (here I will just paste first couple of lines), but renders ÷ as ?:

[ziga@localhost ~]$ ./test.elf components-utf8.csv a 
―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
1:
    Commencing import procedure of file "components-utf8.csv" into SQLite database "a".
―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
2:
    CSV file "components-utf8.csv" found.
―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
3:
    Printing inputed file "components-utf8.csv":

id_articolo,codice,descrizione,esistenza,disponibilita,qta_rim_iniziale,qta_caricata,qta_scaricata,qta_ord_clienti,qta_ord_fornitori,val_rim_iniziale,val_caricato,val_scaricato,ultimo_costo,c_scorta_min,c_cod_fornitore,c_des_fornitore,c_prd_qta_avanz,c_prd_qta_wip,prezzo_listino,codice,qta_altri_carichi,qta_altri_scarichi
41,15MQ040N,Diodo schottky 3A 40V SMA,6755,0000,6755,0000,6755,0000,0,0,0,0,0,0,0,0,0,,,0,0,0,NR,0,0
49,24LC256-I/SN,Memoria flash 8 pin SOIC-8 256kbit,22,0000,22,0000,22,0000,0,0,0,0,16,0600,0,0,0,0,57010035,EBV Elektronik,0,0,0,NR,0,0
2156,24LC512-I/SN,"Memoria EEPROM I2C 64kx8bit 2,5?5,5V 400kHz SOIC8",92,0000,92,0000,92,0000,0,0,0,0,50,6000,0,0,0,0,57010274,GSE s.r.l.,0,0,0,NR,0,0

This is a mistery to me. Especialy because my program sets it's internal encoding to immitate system's encoding and I also use wide printing functions. Here is the source code of the program:

// Headers:
#include <locale.h>
#include <wchar.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

// Function prototypes:
void ruler(void);

// Function definitions:
void ruler(void){
    char* r1 = getenv("COLUMNS");
    int r2;
    if(r1 == NULL){
        r2 = 100;
    }
    else{
        r2 = strtol(r1, NULL, 10);
    }
    int i;
    for(i = 0; i < r2; i++){
        putwchar(L'―');
    }
    putwchar(L'\n');
}

// Entry point:
int main(int argc, char** argv){

    // Setting the user-perfered locale.
    setlocale(LC_ALL, "en_US.UTF-8");

    ruler();

    // Check if exactly two arguments are passed to the binary
    if(argc != 3){
        wprintf(L"USAGE:\n\t%s <CSV file in UTF-8 encoding> <database>\n\nHINT:\n\tUse terminal application \"file\" to guess CSV file's encoding and \"iconv\" to transcode it to UTF-8\n", argv[0]);
        ruler();
        return 1;
    }
    else{
        wprintf(L"1:\n\tCommencing import procedure of file \"%s\" into SQLite database \"%s\".\n", argv[1], argv[2]);
        ruler();
    }

    // Open CSV file
    FILE* csv_file = fopen(argv[1], "r");
    if(csv_file == NULL){
        wprintf(L"2:\n\tCSV file \"%s\" not found.\n", argv[1]);
        ruler();
        return 1;
    }
    else{
        wprintf(L"2:\n\tCSV file \"%s\" found.\n", argv[1]);
        ruler();
    }

    // Print CSV file
    wprintf(L"3:\n\tPrinting inputed file \"%s\":\n\n", argv[1]);
    char c = fgetwc(csv_file);
    while(c != WEOF){
        putwchar(c);
        c = fgetwc(csv_file);
    }
    putwchar(L'\n');

    return 0;

}
71GA
  • 1,132
  • 6
  • 36
  • 69

1 Answers1

1

You need to use the correct type to represent wide characters - char is not sufficient for that.

char c = fgetwc(csv_file);

should be :

wint_t c = fgetwc(csv_file);

as per the fgetwc reference.

For other uses (ie. when not dealing with a return value), there's wchar_t to represent wide characters.

Sander De Dycker
  • 16,053
  • 1
  • 35
  • 40
  • Thank you. Is using `wint_t` safe? Because I read that `wchar_t` is to be avoided since some compilers allocate it only 16 bits - not enough to represent Unicode. – 71GA Feb 20 '20 at 08:02
  • 1
    @71GA : indeed - wide characters in C are not guaranteed to support all unicode characters (be of a specific encoding) - it's [implementation defined](https://stackoverflow.com/questions/11287213/what-is-a-wide-character-string-in-c-language) (on purpose). Since your code uses `fgetwc` and `putwchar`, you chose to use (your implementation's version of) wide characters, so you need to use the corresponding types. If you're looking for more portable support for unicode, you'll probably have to roll your own or use an existing library. – Sander De Dycker Feb 20 '20 at 08:21
  • When you use `wchar_t` as 16bit type, normally you use utf-16 encoding, that allows to encode the extended unicode as a pair of surrogates. this can be used also with utf terminals, that normally use utf-8 encoding. IMHO, you need to have a look to the unicode standard and the different encodings it supports. – Luis Colorado Feb 21 '20 at 19:26