I have a C program test.elf that should process UTF-8
encoded file and print it inside a terminal on a UTF-8
system. Now someone gave me a file components.csv which is ISO-8859-1
encoded. And I encountered problems.
My system's encoding can be checked in a terminal and it is indeed UTF-8
:
[ziga@localhost ~]$ echo $LANG
en_US.UTF-8
I can also check or guess (!) file's encoding which is one of the ISO-8859-{1,2,3,4,5,6,7,8,9,10,11,13,14,15}
(source):
[ziga@localhost ~]$ file components.csv
components.csv: ISO-8859 text, with very long lines, with CRLF line terminators
If I read this file directly using cat
and limit output to first couple of lines using head
, I see the first unknown character �
. This is expected, because system is in UTF-8
that can handle ASCII
characters, but not extended ASCII
characters (source) where probably �
belongs to:
[ziga@localhost ~]$ cat components.csv | head -n4
id_articolo,codice,descrizione,esistenza,disponibilita,qta_rim_iniziale,qta_caricata,qta_scaricata,qta_ord_clienti,qta_ord_fornitori,val_rim_iniziale,val_caricato,val_scaricato,ultimo_costo,c_scorta_min,c_cod_fornitore,c_des_fornitore,c_prd_qta_avanz,c_prd_qta_wip,prezzo_listino,codice,qta_altri_carichi,qta_altri_scarichi
41,15MQ040N,Diodo schottky 3A 40V SMA,6755,0000,6755,0000,6755,0000,0,0,0,0,0,0,0,0,0,,,0,0,0,NR,0,0
49,24LC256-I/SN,Memoria flash 8 pin SOIC-8 256kbit,22,0000,22,0000,22,0000,0,0,0,0,16,0600,0,0,0,0,57010035,EBV Elektronik,0,0,0,NR,0,0
2156,24LC512-I/SN,"Memoria EEPROM I2C 64kx8bit 2,5�5,5V 400kHz SOIC8",92,0000,92,0000,92,0000,0,0,0,0,50,6000,0,0,0,0,57010274,GSE s.r.l.,0,0,0,NR,0,0
Now If I process this file directly with my program, the program will end it's execution at this exact character and this is also expected:
[ziga@localhost ~]$ ./test.elf components.csv a
―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
1:
Commencing import procedure of file "components.csv" into SQLite database "a".
―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
2:
CSV file "components.csv" found.
―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
3:
Printing inputed file "components.csv":
id_articolo,codice,descrizione,esistenza,disponibilita,qta_rim_iniziale,qta_caricata,qta_scaricata,qta_ord_clienti,qta_ord_fornitori,val_rim_iniziale,val_caricato,val_scaricato,ultimo_costo,c_scorta_min,c_cod_fornitore,c_des_fornitore,c_prd_qta_avanz,c_prd_qta_wip,prezzo_listino,codice,qta_altri_carichi,qta_altri_scarichi
41,15MQ040N,Diodo schottky 3A 40V SMA,6755,0000,6755,0000,6755,0000,0,0,0,0,0,0,0,0,0,,,0,0,0,NR,0,0
49,24LC256-I/SN,Memoria flash 8 pin SOIC-8 256kbit,22,0000,22,0000,22,0000,0,0,0,0,16,0600,0,0,0,0,57010035,EBV Elektronik,0,0,0,NR,0,0
2156,24LC512-I/SN,"Memoria EEPROM I2C 64kx8bit 2,5
But now I will convert the file's encoding and create a new file components-utf8.csv in UTF-8
encoding. I tried this procedure multiple times for every ISO-8859-{1,2,3,4,5,6,7,8,9,10,11,13,14,15}
encoding and the solution below yields best results:
iconv -f ISO-8859-1 -t UTF-8 components.csv > components-utf8.csv
If I process new file using cat
and head
, unknown character now renders fine as ÷
:
[ziga@localhost ~]$ cat components-utf8.csv | head -n4
id_articolo,codice,descrizione,esistenza,disponibilita,qta_rim_iniziale,qta_caricata,qta_scaricata,qta_ord_clienti,qta_ord_fornitori,val_rim_iniziale,val_caricato,val_scaricato,ultimo_costo,c_scorta_min,c_cod_fornitore,c_des_fornitore,c_prd_qta_avanz,c_prd_qta_wip,prezzo_listino,codice,qta_altri_carichi,qta_altri_scarichi
41,15MQ040N,Diodo schottky 3A 40V SMA,6755,0000,6755,0000,6755,0000,0,0,0,0,0,0,0,0,0,,,0,0,0,NR,0,0
49,24LC256-I/SN,Memoria flash 8 pin SOIC-8 256kbit,22,0000,22,0000,22,0000,0,0,0,0,16,0600,0,0,0,0,57010035,EBV Elektronik,0,0,0,NR,0,0
2156,24LC512-I/SN,"Memoria EEPROM I2C 64kx8bit 2,5÷5,5V 400kHz SOIC8",92,0000,92,0000,92,0000,0,0,0,0,50,6000,0,0,0,0,57010274,GSE s.r.l.,0,0,0,NR,0,0
If I process new file with my program, it executes from the start till the end (here I will just paste first couple of lines), but renders ÷
as ?
:
[ziga@localhost ~]$ ./test.elf components-utf8.csv a
―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
1:
Commencing import procedure of file "components-utf8.csv" into SQLite database "a".
―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
2:
CSV file "components-utf8.csv" found.
―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
3:
Printing inputed file "components-utf8.csv":
id_articolo,codice,descrizione,esistenza,disponibilita,qta_rim_iniziale,qta_caricata,qta_scaricata,qta_ord_clienti,qta_ord_fornitori,val_rim_iniziale,val_caricato,val_scaricato,ultimo_costo,c_scorta_min,c_cod_fornitore,c_des_fornitore,c_prd_qta_avanz,c_prd_qta_wip,prezzo_listino,codice,qta_altri_carichi,qta_altri_scarichi
41,15MQ040N,Diodo schottky 3A 40V SMA,6755,0000,6755,0000,6755,0000,0,0,0,0,0,0,0,0,0,,,0,0,0,NR,0,0
49,24LC256-I/SN,Memoria flash 8 pin SOIC-8 256kbit,22,0000,22,0000,22,0000,0,0,0,0,16,0600,0,0,0,0,57010035,EBV Elektronik,0,0,0,NR,0,0
2156,24LC512-I/SN,"Memoria EEPROM I2C 64kx8bit 2,5?5,5V 400kHz SOIC8",92,0000,92,0000,92,0000,0,0,0,0,50,6000,0,0,0,0,57010274,GSE s.r.l.,0,0,0,NR,0,0
This is a mistery to me. Especialy because my program sets it's internal encoding to immitate system's encoding and I also use wide printing functions. Here is the source code of the program:
// Headers:
#include <locale.h>
#include <wchar.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
// Function prototypes:
void ruler(void);
// Function definitions:
void ruler(void){
char* r1 = getenv("COLUMNS");
int r2;
if(r1 == NULL){
r2 = 100;
}
else{
r2 = strtol(r1, NULL, 10);
}
int i;
for(i = 0; i < r2; i++){
putwchar(L'―');
}
putwchar(L'\n');
}
// Entry point:
int main(int argc, char** argv){
// Setting the user-perfered locale.
setlocale(LC_ALL, "en_US.UTF-8");
ruler();
// Check if exactly two arguments are passed to the binary
if(argc != 3){
wprintf(L"USAGE:\n\t%s <CSV file in UTF-8 encoding> <database>\n\nHINT:\n\tUse terminal application \"file\" to guess CSV file's encoding and \"iconv\" to transcode it to UTF-8\n", argv[0]);
ruler();
return 1;
}
else{
wprintf(L"1:\n\tCommencing import procedure of file \"%s\" into SQLite database \"%s\".\n", argv[1], argv[2]);
ruler();
}
// Open CSV file
FILE* csv_file = fopen(argv[1], "r");
if(csv_file == NULL){
wprintf(L"2:\n\tCSV file \"%s\" not found.\n", argv[1]);
ruler();
return 1;
}
else{
wprintf(L"2:\n\tCSV file \"%s\" found.\n", argv[1]);
ruler();
}
// Print CSV file
wprintf(L"3:\n\tPrinting inputed file \"%s\":\n\n", argv[1]);
char c = fgetwc(csv_file);
while(c != WEOF){
putwchar(c);
c = fgetwc(csv_file);
}
putwchar(L'\n');
return 0;
}