2

I have a large set of nested directories containing PHP, HTML, and Javascript files that should all be encoded as UTF-8. However, someone edited several of the files and saved them with ISO-8859-1 encoding. Unfortunately, they're all mixed in with the UTF-8 files.

I'd like to use the iconv tool to convert the incorrectly-encoded files to UTF-8 (as described here). Primarily, the problems occur with characters that are valid ISO-8859-1 but invalid UTF-8.

I think an appropriate starting point would be to find all files that contain invalid UTF-8. What's a good way to do this?

I realise this won't catch all of the cases where the wrong character might be displayed. Any further tips on how I might fix this mess?

Community
  • 1
  • 1
Jonathan
  • 7,536
  • 4
  • 30
  • 44
  • Consider having a look at [Windows-1252 to UTF-8 encoding](https://stackoverflow.com/a/75810471). – Henke Mar 22 '23 at 09:56

1 Answers1

2

This would be a bit of a hack, but since it's a one-off occurrence then it might be worth it. iconv will complain about invalid encoding if it can't read the file using the encoding you give it. Therefore, you could write a wrapper script to iterate over all the files, attempting to convert them from UTF-8 to something else, and those that can't be converted have invalid UTF-8.

chooban
  • 9,018
  • 2
  • 20
  • 36
  • Cool! That's pretty much what I did: `iconv -f UTF-8 -t UTF-8 | grep "^iconv"` handled it pretty well. – Jonathan Oct 05 '12 at 13:32