3

I'm a BI developer working with perl scripts as my ETL - I receive data over email, take the file, parse it and push it into the DB. Most of the files are CSV, but occasionally I have an XLSX file.

I've been using Spreadsheet::XLSX to convert, but I've noticed that the CSV output comes out with the wrong encoding (needs to be UTF8, because accents and foreign languages).

That's the sub I'm using ($input_file is an Excel file), but I keep getting the data with the wrong characters.

WHAT am I missing?

Thanks a lot all!

sub convert_to_csv {
    my $input_file = $_[0];
    my ( $filename, $extension ) = split( '\.', $input_file );
    open( format_file, ">:**encoding(utf-8)**", "$filename.csv" ) or die "could not open out file $!\n";
    my $excel = Spreadsheet::XLSX->new($input_file);
    my $line;
    foreach my $sheet ( @{ $excel->{Worksheet} } ) {

        #printf( "Sheet: %s\n", $sheet->{Name} );
        $sheet->{MaxRow} ||= $sheet->{MinRow};
        foreach my $row ( $sheet->{MinRow} .. $sheet->{MaxRow} ) {
            $sheet->{MaxCol} ||= $sheet->{MinCol};
            foreach my $col ( $sheet->{MinCol} .. $sheet->{MaxCol} ) {
                my $cell = $sheet->{Cells}[$row][$col];
                if ($cell) {
                    my $trimcell;
                    $trimcell = $cell->value();
                    print STDERR "cell: $trimcell\n"; ## Just for the tests so I don't have to open the file to see if it's ok
                    $trimcell =~ s/^\s+|\s+$//g;  ## Just to make sure I don't have extra spaces
                    $line .= "\"" . $trimcell  . "\",";
                }
            }
            chomp($line);
            if ($line =~ /Grand Total/){} ##customized for the files
            else {
            print format_file "$line\n";
            $line = '';
        }
        }
    }
    close format_file;
}
AALB
  • 31
  • 2
  • Where are you seeing the wrong characters, in the DB ? – CDP1802 Feb 24 '20 at 19:23
  • Do you have anything like following at the begin of you code? `use utf8; use open qw/ :std :encoding(UTF-8) /;` – Polar Bear Feb 24 '20 at 20:41
  • yes... the problem arises somewhere in that snippet. I have useutf8; binmode(STDOUT, ":utf8") and it doesn't change a thing. – AALB Feb 25 '20 at 07:36
  • @CDP1802 nope, in the file itself. Excel has the right characters, CSV doesn't. – AALB Feb 25 '20 at 07:37
  • So what client are you using to inspect the csv, a text editor ? Or are you using Excel to view the csv. You can add [BOM](https://en.wikipedia.org/wiki/Byte_order_mark) characters to the beginning of the file but it shouldn't really be necessary. – CDP1802 Feb 25 '20 at 07:51
  • Note that Excel will not open CSV files correctly by default, but the BOM may cause it to interpret it as UTF-8 on newer Excels. See https://stackoverflow.com/questions/6002256/is-it-possible-to-force-excel-recognize-utf-8-csv-files-automatically – Grinnz Feb 25 '20 at 21:05
  • I recommend using [Text::CSV](https://metacpan.org/pod/Text::CSV) to create the CSV. (You can apply the same encoding layer to the handle, but it will manage quoting as needed.) – Grinnz Feb 25 '20 at 21:07

1 Answers1

1

My knowledge is from using ETL::Pipeline and it uses Spreadsheet::XLSX for reading .xlsx-files. But I know which fields are UTF-8

I wrote a Local ETL::Pipeline module to handle output for Excel files

use Encode qw(decode encode);

$ra_rec->{name} = decode( 'UTF-8', $ra_rec->{name}, Encode::FB_CROAK );
Denis Ibaev
  • 2,470
  • 23
  • 29
guest
  • 11
  • 1
  • To make this sentence clearer "I wrote a Local ETL::Pipeline module to handle output for Excel files". TO: I wrote a Local ETL::Pipeline module to handle UTF-8 before inserting the line/record into the database. – guest Mar 02 '20 at 20:25