-1

I have a python program that parses text files line for line. Some, very few, of these lines are corrupt meaning that they have non utf-8 characters. Once a line has a corrupt character, then the whole content of the line is waste. So solutions that delete, replace etc single characters won't do - I want to delete any line with none utf-8 characters as priority number 1, but saving it to another file to inspect it further is of interest if possible. All previous solutions I find only delete/replace non utf-8 characters.

My main language is python, however I am working in Linux so bash etc is a viable solution.

bjornasm
  • 2,211
  • 7
  • 37
  • 62

1 Answers1

1

My main language is python, however I am working in Linux so bash etc is a viable solution.

I don't know python well enough to use it for an answer, so here's a perl version. The logic should be pretty similar:

#!/usr/bin/env perl
use warnings;
use strict;
use Encode;

# One argument: filename to log corrupt lines to. Reads from standard
# input, prints valid lines on standard output; redirect to another
# file if desired.

# Treat input and outputs as binary streams, except STDOUT is marked
# as UTF8 encoded.
open my $errors, ">:raw", $ARGV[0] or die "Unable to open $ARGV[0]: $!\n";
binmode STDIN, ":raw";
binmode STDOUT, ":raw:utf8";

# For each line read from standard input, print it to standard
# output if valid UTF-8, otherwise log it.
while (my $line = <STDIN>) {
    eval {
        # Default decode behavior is to replace invalid sequences with U+FFFD.
        # Raise an error instead.
        print decode("UTF-8", $line, Encode::FB_CROAK);
    } or print $errors $line;
}

close $errors;
Shawn
  • 47,241
  • 3
  • 26
  • 60