1. The files which are already in UTF-8 should not be changed 1
When I recently had this issue, I solved it by first finding all
files in need of conversion.
I did this by excluding the files that should not be converted.
This includes binary files, pure ASCII files (which
by definition already have a valid UTF-8 encoding), and files that
contain at least some valid non-ASCII UTF-8 characters.
In short, I recursively searched the files that probably should be
converted :
$ find . -type f -name '*' -exec sh -c 'for n; do file -i "$n" | grep -Ev "binary|us-ascii|utf-8"; done' sh {} +
I had a subdirectory tree containing some 300 – 400 files.
About half a dozen of them turned out to be wrongly encoded, and
typically returned responses like :
./<some-path>/plain-text-file.txt: text/plain; charset=iso-8859-1
./<some-other-path>/text-file.txt: text/plain; charset=unknown-8bit
Note how the encoding was either iso-8859-1
, or unknown-8bit
.
This makes sense – any non-ASCII Windows-1252 character can either
be a valid ISO 8859-1 character – or – it can be one of the 27
characters in the 128 – 159 (x80 – x9F) range for which no printable
ISO 8859-1 characters are defined.
1. a. A caveat with the find . -exec
solution 2
A problem with the find . -exec
solution is that it can be very slow
– a problem that grows with the size of the subdirectory tree under
scrutiny.
In my experience, it might be faster – potentially much faster –
to run a number of commands instead of the single command suggested
above, as follows :
$ file -i * | grep -Ev "binary|us-ascii|utf-8"
$ file -i */* | grep -Ev "binary|us-ascii|utf-8"
$ file -i */*/* | grep -Ev "binary|us-ascii|utf-8"
$ file -i */*/*/* | grep -Ev "binary|us-ascii|utf-8"
$ …
Continue increasing the depth in these commands until the response is
something like this:
*/*/*/*/*/*/*: cannot open `*/*/*/*/*/*/*' (No such file or directory)
Once you see cannot open / (No such file or directory)
, it is
clear that the entire subdirectory tree has been searched.
2. Convert the culprit files
Now that all suspicious files have been found, I prefer to use a text
editor to help with the conversion, instead of using a command line
tool like recode
.
2. a. On Windows, consider using Notepad++
On Windows, I like to use Notepad++ for converting files.
Have a look at this excellent post if you need help on that.
2. b. On Linux or macOS, consider using Visual Studio Code
On Linux and macOS, try VS Code for converting files.
I've given a few hints in this post.
References
1
Section 1 relies on using the file
command, which unfortunately
isn't completely reliable.
As long as all your files are smaller than 64 kB, there shouldn't be
any problem.
For files (much) larger than 64 kB, there is a risk that non-ASCII
files will falsely be identified as pure ASCII files.
The fewer non-ASCII characters in such files, the bigger the risk
that they will be wrongly identified.
For more on this, see this post and its comments.
2
Subsection 1. a. is inspired by this answer.