79

I'm processing some data files that are supposed to be valid UTF-8 but aren't, which causes the parser (not under my control) to fail. I'd like to add a stage of pre-validating the data for UTF-8 well-formedness, but I've not yet found a utility to help do this.

There's a web service at W3C which appears to be dead, and I've found a Windows-only validation tool that reports invalid UTF-8 files but doesn't report which lines/characters to fix.

I'd be happy with either a tool I can drop in and use (ideally cross-platform), or a ruby/perl script I can make part of my data loading process.

Csa77
  • 649
  • 13
  • 19
Ian Dickinson
  • 12,875
  • 11
  • 40
  • 67

6 Answers6

111

You can use GNU iconv:

$ iconv -f UTF-8 your_file -o /dev/null; echo $?

Or with older versions of iconv, such as on macOS:

$ iconv -f UTF-8 your_file > /dev/null; echo $?

The command will return 0 if the file could be converted successfully, and 1 if not. Additionally, it will print out the byte offset where the invalid byte sequence occurred.

Edit: The output encoding doesn't have to be specified, it will be assumed to be UTF-8.

Richard Gomes
  • 5,675
  • 2
  • 44
  • 50
Torsten Marek
  • 83,780
  • 21
  • 91
  • 98
  • 18
    In older versions of iconv, like that on OSX or in fink, there is no -o flag. Redirecting stdout should always work, however. – Joe Hildebrand Sep 22 '08 at 15:07
  • 1
    Torsten, thanks this works perfectly on my linux machine. I couldn't find a version of iconv utility for cygwin, but that's not a showstopper. – Ian Dickinson Sep 22 '08 at 16:16
  • You can also use the tool to sanitize the files instead, if you don't mind losing some info by specifying: UTF-8//TRANSLIT, UTF-8//IGNORE or even UTF-8//TRANSLIT//IGNORE. – webmat Mar 23 '12 at 13:32
  • @IanDickinson: `iconv` is part of Cygwin. As of April 2017, it's in package "libiconv". – sleske Apr 18 '17 at 11:13
  • 3
    Better yet, redirect stdout and stderr to /dev/null: `iconv -f UTF-8 your_file > /dev/null 2>&1; echo $?` – Stratus3D Jul 31 '17 at 18:52
  • 5
    "The output encoding [...] will be assumed to be UTF-8" contradicts the documentation, which says that the encoding "defaults to the encoding of the current locale" (GNU man page). If any character in the input is not supported by that encoding, a "cannot convert" or "illegal input sequence" error will be emitted, even if the input is valid UTF-8. Use `iconv -f UTF-8 -t UTF-8 your_file > /dev/null` to avoid these false positives. – MvanGeest Jun 06 '19 at 18:39
12

You can use isutf8 from the moreutils collection.

$ apt-get install moreutils
$ isutf8 your_file

In a shell script, use the --quiet switch and check the exit status, which is zero for files that are valid utf-8.

Roger Dahl
  • 15,132
  • 8
  • 62
  • 82
9

Use python and str.encode|decode functions.

>>> a="γεια"
>>> a
'\xce\xb3\xce\xb5\xce\xb9\xce\xb1'
>>> b='\xce\xb3\xce\xb5\xce\xb9\xff\xb1' # note second-to-last char changed
>>> print b.decode("utf_8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.5/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 6: unexpected code byte

The exception thrown has the info requested in its .args property.

>>> try: print b.decode("utf_8")
... except UnicodeDecodeError, exc: pass
...
>>> exc
UnicodeDecodeError('utf8', '\xce\xb3\xce\xb5\xce\xb9\xff\xb1', 6, 7, 'unexpected code byte')
>>> exc.args
('utf8', '\xce\xb3\xce\xb5\xce\xb9\xff\xb1', 6, 7, 'unexpected code byte')
tzot
  • 92,761
  • 29
  • 141
  • 204
5

How about the gnu iconv library? Using the iconv() function: "An invalid multibyte sequence is encountered in the input. In this case it sets errno to EILSEQ and returns (size_t)(-1). *inbuf is left pointing to the beginning of the invalid multibyte sequence."

EDIT: oh - i missed the part where you want a scripting language. But for command line work, the iconv utility should validate for you too.

AShelly
  • 34,686
  • 15
  • 91
  • 152
2

Here is the bash script to check whether a file is valid UTF-8 or not:

#!/bin/bash

inputFile="./testFile.txt"

iconv -f UTF-8 "$inputFile" -o /dev/null

if [[ $? -eq 0 ]]
then
    echo "Valid UTF-8 file.";
else
    echo "Invalid UTF-8 file!";
fi

Description:

  • --from-code, -f encoding (Convert characters from encoding)
  • --to-code, -t encoding (Convert characters to encoding, it doesn't have to be specified, it will be assumed to be UTF-8.)
  • --output, -o file (Specify output file 'instead of stdout')
Sherzad
  • 405
  • 4
  • 14
  • Downvoted because this adds nothing to the [accepted answer](https://stackoverflow.com/a/115262/735926). – bfontaine Jul 10 '23 at 08:05
0

You can also use recode, which will exit with an error if it tries to decode UTF-8 and encounters invalid characters.

if recode utf8/..UCS < "$FILE" >/dev/null 2>&1; then
    echo "Valid utf8 : $FILE"
else
    echo "NOT valid utf8: $FILE"
fi

This tries to recode to the Universal Character Set (UCS) which is always possible from valid UTF-8.

mivk
  • 13,452
  • 5
  • 76
  • 69
  • does this change the data? i would like to stream my data through recode, fail if it's bad and if it's good i'd like the data to stay in utf8 – Binyamin Sep 30 '20 at 06:58
  • 1
    @Binyamin: Yes, it does recode. In the example above, it just sends the recoded data to /dev/null. But you could also pipe to a second recode which puts it back in utf8: `recode utf8/..UCS < "$FILE" | recode UCS/..utf8`. That would abort the output when encountering invalid data. – mivk Sep 30 '20 at 09:11
  • thanks, i thought i might end up needing to recode twice – Binyamin Sep 30 '20 at 10:27