I have a situation where my perl parser will read flat fixed-width input file based on the specification (start, end, length of each field defined) and will create a comma delimited file to be loaded into the database.
The input file can be ISO-LATIN-1 or UTF-8. Regardless of the charset, the perl's doing a good job of creating the comma delimited files fine (ISO-LATIN-1 to ISO-LATIN-1 and utf8 to utf8)
Since, ISO-LATIN-1 chars occupy only one byte there's never a problem. But, the utf-8 is causing an issue after the data is loaded into the database. Since, the perl parser is going by bytes when reading the input data, if there happens to be a field of 40 bytes length, but there is a utf char occupying positions 39,40,41, then only the first 2 bytes are extracted into the field and the same is being loaded into the database.
Is there a way for perl, to read this string and remove the bad bytes towards the end of the string?
For eg: Say there's a 6 byte field and the char sequence is Â8ÄÂ where the byte sequence is c382 38 c384 c382 (which is 7 bytes). When the perl parser parses this data, it appears to be fetching Â8Ä, but looking at the byte values its extracting c382 38 c384 c3. There' a half byte c3 at the end. Is there a way to strip of this kind of bad bytes using perl?