95

I have a bunch of Arabic, English, Russian files which are encoded in utf-8. Trying to process these files using a Perl script, I get this error:

Malformed UTF-8 character (fatal)

Manually checking the content of these files, I found some strange characters in them. Now I'm looking for a way to automatically remove these characters from the files.

Is there anyway to do it?

Hakim
  • 11,110
  • 14
  • 34
  • 37
  • 2
    Maybe it's the same as this: http://stackoverflow.com/questions/7656283/malformed-utf-8-character-fatal-error-while-parsing-xml-using-xmllibxml – Olaf Dietsche Oct 21 '12 at 16:29
  • 2
    Please refer to this link: http://unix.stackexchange.com/questions/6516/filtering-invalid-utf8 – askmish Oct 21 '12 at 16:50
  • 4
    What are non UTF-8 characters? All characters in a well formed UTF-8 string are UTF-8 (actually Unicode) characters! Some of them are UTF-8 encoded in several consecutive bytes.... – Basile Starynkevitch Oct 21 '12 at 16:58
  • 3
    @BasileStarynkevitch: the error message clearly states that there is a malformed UTF-8 character. That means that a byte appeared that cannot appear as part of a valid UTF-8 file. That's not hard; it could be a 0xC0 or 0xC1 byte, or 0xF5..0xFF, or a sequencing problem with bytes that would otherwise be valid. – Jonathan Leffler Dec 08 '12 at 04:53

4 Answers4

181

This command:

iconv -f utf-8 -t utf-8 -c file.txt

will clean up your UTF-8 file, skipping all the invalid characters.

-f is the source format
-t the target format
-c skips any invalid sequence
wberry
  • 18,519
  • 8
  • 53
  • 85
Palantir
  • 23,820
  • 10
  • 76
  • 86
  • 11
    "iconv -f utf-8 -t utf-8 -c file.txt" on a Mac. hyphen between 'f' and '8' – Colin Nov 20 '13 at 18:14
  • Correct, hyphens are required. Thanks for the edit. You can get the list of supported encodings via iconv --list – Palantir Feb 14 '14 at 08:57
  • 1
    Conveniently you can transform the clipboard contents on a Mac doing so: `pbpaste | iconv -f utf-8 -t -utf-8 -c | pbcopy`. I also created an Alfred workflow with a global shortcut for stripping all special characters by targeting `ascii`. – Lenar Hoyt Jul 16 '14 at 12:59
  • 1
    This produced a file that was completely blank for me. Just want to let everyone know this is potentially destructive and to back up their file before running this on it. – counterbeing Sep 05 '14 at 17:35
  • 8
    `iconv -f utf-8 -t ascii//TRANSLIT` solved my problem. It converts curly quotes to straight quotes. – Colonel Panic Mar 30 '16 at 15:27
  • 6
    `-o` for different output file – codaamok Aug 17 '16 at 12:44
  • What do you do when it cant convert the characters? `iconv: something.csv:440938:30: cannot conver` – timbram Jun 14 '18 at 18:52
  • Adding a small caveat to this great answer, iconv will try to load your whole file into memory before processing it, which might be unsuitable for streams of data, or large text file – dvhh Mar 08 '19 at 06:23
  • You can redirect input to output, i.e. `iconv -f utf-8 -t ascii//TRANSLIT -c thumbnail.sub -o thumbnail.sub` works to overwrite the file (test first without the `-o` !). For those who know `sed -i`; in other words, the -o for iconv is somewhat similar to sed 's `-i` (inline replace) with the difference that you have to specify the same file name again. – Roel Van de Paar Jun 04 '19 at 21:15
0

iconv can do it

iconv -f cp1252 foo.txt
Zombo
  • 1
  • 62
  • 391
  • 407
0

Your method must read byte by byte and fully understand and appreciate the byte wise construction of characters. The simplest method is to use an editor which will read anything but only output UTF-8 characters. Textpad is one choice.

Charles Knell
  • 342
  • 1
  • 7
  • iconv is not available in cygwin. Is there any way to do this on windows/cygwin? I have a big (100000+ lines) XML file that needs stripping of invalid characters. I don't care about valid utf-8. I've set notepad++ to utf-8, but even after saving it from there I still get errors in the XML parser – mljm Apr 07 '17 at 21:07
  • ubuntu WSL on Windows it comes with iconv – Kat Lim Ruiz Nov 04 '20 at 02:11
0

None of the methods here or on any other similar questions worked for me. In the end what worked was simply opening the file in Sublime Text 2. Go to File > Reopen with Encoding > UTF-8. Copy the entire content of the file into a new file and save it.

May not be the expected solution but putting this out here in case it helps anyone, since I've been struggling for hours with this.

Mythos
  • 1,378
  • 1
  • 8
  • 21