Identifying and removing lines with non utf-8 characters in files

Question

I have a python program that parses text files line for line. Some, very few, of these lines are corrupt meaning that they have non utf-8 characters. Once a line has a corrupt character, then the whole content of the line is waste. So solutions that delete, replace etc single characters won't do - I want to delete any line with none utf-8 characters as priority number 1, but saving it to another file to inspect it further is of interest if possible. All previous solutions I find only delete/replace non utf-8 characters.

My main language is python, however I am working in Linux so bash etc is a viable solution.

read a line and try `str.decode(encoding='utf-8')` Exception means discard the line. — stark, Mar 08 '22 at 19:15
You can take a look on this solution: https://stackoverflow.com/questions/29465612/how-to-detect-invalid-utf8-unicode-binary-in-a-text-file — Alex Sveshnikov, Mar 08 '22 at 19:16

score 1 · Answer 1 · answered Mar 08 '22 at 23:35

My main language is python, however I am working in Linux so bash etc is a viable solution.

I don't know python well enough to use it for an answer, so here's a perl version. The logic should be pretty similar:

#!/usr/bin/env perl
use warnings;
use strict;
use Encode;

# One argument: filename to log corrupt lines to. Reads from standard
# input, prints valid lines on standard output; redirect to another
# file if desired.

# Treat input and outputs as binary streams, except STDOUT is marked
# as UTF8 encoded.
open my $errors, ">:raw", $ARGV[0] or die "Unable to open $ARGV[0]: $!\n";
binmode STDIN, ":raw";
binmode STDOUT, ":raw:utf8";

# For each line read from standard input, print it to standard
# output if valid UTF-8, otherwise log it.
while (my $line = <STDIN>) {
    eval {
        # Default decode behavior is to replace invalid sequences with U+FFFD.
        # Raise an error instead.
        print decode("UTF-8", $line, Encode::FB_CROAK);
    } or print $errors $line;
}

close $errors;

Identifying and removing lines with non utf-8 characters in files

1 Answers1