0

I have a situation where my perl parser will read flat fixed-width input file based on the specification (start, end, length of each field defined) and will create a comma delimited file to be loaded into the database.

The input file can be ISO-LATIN-1 or UTF-8. Regardless of the charset, the perl's doing a good job of creating the comma delimited files fine (ISO-LATIN-1 to ISO-LATIN-1 and utf8 to utf8)

Since, ISO-LATIN-1 chars occupy only one byte there's never a problem. But, the utf-8 is causing an issue after the data is loaded into the database. Since, the perl parser is going by bytes when reading the input data, if there happens to be a field of 40 bytes length, but there is a utf char occupying positions 39,40,41, then only the first 2 bytes are extracted into the field and the same is being loaded into the database.

Is there a way for perl, to read this string and remove the bad bytes towards the end of the string?

For eg: Say there's a 6 byte field and the char sequence is Â8ÄÂ where the byte sequence is c382 38 c384 c382 (which is 7 bytes). When the perl parser parses this data, it appears to be fetching Â8Ä, but looking at the byte values its extracting c382 38 c384 c3. There' a half byte c3 at the end. Is there a way to strip of this kind of bad bytes using perl?

SNL
  • 21
  • 1
  • 1
  • 4
  • Terminology note: `c3` is a full byte, not a half byte. – chepner Aug 02 '12 at 18:59
  • sorry, i meant a dangling byte. – SNL Aug 02 '12 at 19:53
  • possible duplicate of [perl- trim utf8 bytes to 'length' and sanitize the data](http://stackoverflow.com/questions/10953069/perl-trim-utf8-bytes-to-length-and-sanitize-the-data) – Oleg V. Volkov Aug 02 '12 at 19:57
  • 3
    I think you are doing something wrong here and ask us to help doctor around the symptoms instead of just fixing the code so it always decodes properly without losing characters at the end. Show your code. – daxim Aug 03 '12 at 00:41
  • If you always want to read 40 characters, not bytes, you need to tell Perl to `use unicode;` for the Unicode case. Do you know ahead of time which inputs use which character encoding? – tripleee Aug 03 '12 at 16:25

1 Answers1

0

See this:

The 'U' template format of the Perl pack function on this page: http://www.misc-perl-info.com/perl-pack.html

this:

http://ahinea.com/en/tech/perl-unicode-struggle.html

and this:

Perl: utf8::decode vs. Encode::decode

Community
  • 1
  • 1
Patt Mehta
  • 4,110
  • 1
  • 23
  • 47