48

I've copied certain files from a Windows machine to a Linux machine.
All the files encoded with Windows-1252 need to be converted to UTF-8.
The files which are already in UTF-8 should not be changed.

I'm planning to use the recode utility for that. How can I specify that the recode utility should only convert windows-1252 encoded files and not the UTF-8 files?

Example usage of recode:

recode windows-1252.. myfile.txt

This would convert myfile.txt from windows-1252 to UTF-8. Before doing this, I would like to know that myfile.txt is actually windows-1252 encoded and not UTF-8 encoded.
Otherwise, I believe this would corrupt the file.

Henke
  • 4,445
  • 3
  • 31
  • 44
Sam
  • 483
  • 1
  • 4
  • 4

13 Answers13

87

iconv -f WINDOWS-1252 -t UTF-8 filename.txt

Henke
  • 4,445
  • 3
  • 31
  • 44
Gregory Pakosz
  • 69,011
  • 20
  • 139
  • 164
  • 10
    It looks like iconv outputs to STDOUT, so you'll probably want to redirect it, e.g. `... > filename-utf8.txt` – mwfearnley May 17 '17 at 11:18
  • 5
    Beware that if the file is already UTF8, this will happily double-encode it, leaving you with an unreadable mess. – mivk Nov 10 '18 at 13:24
  • Related. * https://stackoverflow.com/q/12726517#comment17221039_12734567 * https://stackoverflow.com/a/12742879 * https://stackoverflow.com/a/24836200 * https://stackoverflow.com/a/9698582 – Henke Mar 18 '23 at 16:23
41

How would you expect recode to know that a file is Windows-1252? In theory, I believe any file is a valid Windows-1252 file, as it maps every possible byte to a character.

Now there are certainly characteristics which would strongly suggest that it's UTF-8 - if it starts with the UTF-8 BOM, for example - but they wouldn't be definitive.

One option would be to detect whether it's actually a completely valid UTF-8 file first, I suppose... again, that would only be suggestive.

I'm not familiar with the recode tool itself, but you might want to see whether it's capable of recoding a file from and to the same encoding - if you do this with an invalid file (i.e. one which contains invalid UTF-8 byte sequences) it may well convert the invalid sequences into question marks or something similar. At that point you could detect that a file is valid UTF-8 by recoding it to UTF-8 and seeing whether the input and output are identical.

Alternatively, do this programmatically rather than using the recode utility - it would be quite straightforward in C#, for example.

Just to reiterate though: all of this is heuristic. If you really don't know the encoding of a file, nothing is going to tell you it with 100% accuracy.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • 15
    There are a few bytes which cp1252 doesn't map to a character: 0x81, 0x8D, 0x8F, 0x90, 0x9D. The point stands, however. I wouldn't try to bulk-convert encodings of files from multiple different sources. – bobince Jan 06 '10 at 17:17
  • Thanks for pointing that out - I really thought *everything* was mapped in 1252. I'm sure it's the case for some other encodings :) – Jon Skeet Jan 06 '10 at 17:56
  • 4
    ISO-8859-1 maps every byte to a character, with the `80..9F` range being the C1 control characters. In Java I can decode every byte in the range `00..FF` to a String using ISO-8859-1, then re-encode it to get the original bytes back. When I try that with windows-1252 I get garbage for the values bobince listed. That surprised me; I thought it would fill those gaps with the corresponding control characters from ISO-8859-1. – Alan Moore Jan 07 '10 at 18:58
  • 3
    @AlanMoore: why would you expect it to fill in the gaps using characters from a different encoding? Windows-1252 and ISO-8859-1 are not the same thing, though may people (apparently also you) think they are. – Remy Lebeau Aug 16 '12 at 23:42
  • 5
    I know they're not the same, but cp1252 is usually described as being the the same as Latin-1 but with most of those useless control characters replaced with useful, printing characters. If Microsoft really had started with Latin-1 and adapted it as that description implies, I would expect the remaining bytes to map to those same control characters. But it turns out the two encodings evolved pretty much side-by-side (sort of), and my assumption made an ass of me and Umption. :-/ – Alan Moore Aug 17 '12 at 10:20
  • I just came across this helpful discussion. I'm maintaining an editor for files that sometimes have unknown or even mixed encodings (text-based flat databases in which each line begins with an ASCII field marker). It is paramount that it not destroy data when saving, so it defaults to cp1252 when it isn't 100% sure that a field's contents are in utf-8. I picked that because it handled curly quotes better, but your comments suggest that ISO-8859-1 would be safer. Correct? (GUI, C#, WinForms) – Jon Coombs Jul 23 '14 at 05:28
  • 2
    @JCoombs: It would be better just not to treat it as text at all, if you don't know the encoding. – Jon Skeet Jul 23 '14 at 06:03
  • That makes a lot of sense. The way this editor is set up, though, I don't think I can change that. And in nearly every case we do know the encoding. The questionable cases are when a user has first opened a file and may hasn't specified the encodings properly. We want the default encoding to be very forgiving in terms of preserving the bytes. So far so good, but if there are edge cases out there, I want to find them. – Jon Coombs Jul 23 '14 at 07:24
  • @AlanMoore: Were you speaking contrastively in your first comment about ISO-8859-1? That's how I first read it, but maybe you weren't really saying that cp1252 does *not* map every byte to a character. According to wikipedia, cp1252 "is a superset of ISO 8859-1, but differs from the IANA's ISO-8859-1 by using displayable characters rather than control characters in the 80 to 9F (hex) range." http://en.wikipedia.org/wiki/Windows-1252 – Jon Coombs Jul 23 '14 at 07:29
  • 3
    @JCoombs: Cp1252 is a superset of ISO 8859-1, but not a superset of ISO-8859-1. Yes, believe it or not, that extra dash makes a difference. ISO-8859-1 fills in bytes 0x80 to 0x9f with U+0080 to U+009F, all of which are control characters IIRC. – Jon Skeet Jul 23 '14 at 07:32
  • @JonSkeet: Wow. Thanks for clarifying that! So it sounds like ISO-8859-1 and cp1252 should be equally good (or bad) at preserving any given piece of text, in which case I'd choose cp1252 because it handles curly quotes better. Or maybe cp1252 is slightly better because I suppose control characters might skew/disappear if the user actually edits a record. – Jon Coombs Jul 23 '14 at 15:30
  • 1
    @JCoombs: Well, sort of. I think you'd need to give me very specific use cases... but fundamentally, if you're dealing with data that you don't know the encoding for, you're really in a nasty situation. – Jon Skeet Jul 23 '14 at 15:37
  • ERROR: character with byte sequence 0xe0 0xb8 0x84 in encoding "UTF8" has no equivalent in encoding "WIN1252" – Thomas Stubbe Aug 25 '20 at 08:50
  • @ThomasStubbe: That suggests you're going in the opposite direction to the question. This question is about going *from* Windows-1252 *to* UTF-8. – Jon Skeet Aug 25 '20 at 08:59
  • I want to go from Windows-1252 to UTF-8. It's an embedded DB on FS (windows), which has troubles parsing UTF-8 Thai-signs (the error above). The comment was just to point out, like some others already did, that UTF-8 does not map every byte to a character – Thomas Stubbe Aug 25 '20 at 09:40
  • @ThomasStubbe: But if you're trying to go *from* Windows-1252, you're mapping *from* characters to bytes with UTF-8, so you shouldn't see that error. I suggest you ask a new question with more details - at the moment I don't think these new comments are adding value to the post. – Jon Skeet Aug 25 '20 at 09:45
9

Here's a transcription of another answer I gave to a similar question:

If you apply utf8_encode() to an already UTF8 string it will return a garbled UTF8 output.

I made a function that addresses all this issues. It´s called Encoding::toUTF8().

You dont need to know what the encoding of your strings is. It can be Latin1 (iso 8859-1), Windows-1252 or UTF8, or the string can have a mix of them. Encoding::toUTF8() will convert everything to UTF8.

I did it because a service was giving me a feed of data all messed up, mixing UTF8 and Latin1 in the same string.

Usage:

$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);

$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);

Download:

https://github.com/neitanod/forceutf8

Update:

I've included another function, Encoding::fixUFT8(), wich will fix every UTF8 string that looks garbled.

Usage:

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples:

echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");

will output:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

Update: I've transformed the function (forceUTF8) into a family of static functions on a class called Encoding. The new function is Encoding::toUTF8().

Sebastián Grignoli
  • 32,444
  • 17
  • 71
  • 86
  • Hi Sebastián. If I have an SQL export, how do I parse the file through your function? Is there a stand-alone script you have written that can be invoked at the command line on the form `fixutf8 input.sql >output.sql` or would you be able to assist me in converting your php to a cli script? – Ali Samii Mar 18 '14 at 12:09
  • The easiest and shortest possible way is this: `` – Sebastián Grignoli Mar 18 '14 at 12:20
  • This is amazing. I love it! – rockstardev Nov 29 '18 at 07:32
8

There's no general way to tell if a file is encoded with a specific encoding. Remember that an encoding is nothing more but an "agreement" how the bits in a file should be mapped to characters.

If you don't know which of your files are actually already encoded in UTF-8 and which ones are encoded in windows-1252, you will have to inspect all files and find out yourself. In the worst case that could mean that you have to open every single one of them with either of the two encodings and see whether they "look" correct -- i.e., all characters are displayed correctly. Of course, you may use tool support in order to do that, for instance, if you know for sure that certain characters are contained in the files that have a different mapping in windows-1252 vs. UTF-8, you could grep for them after running the files through 'iconv' as mentioned by Seva Akekseyev.

Another lucky case for you would be, if you know that the files actually contain only characters that are encoded identically in both UTF-8 and windows-1252. In that case, of course, you're done already.

kleiba
  • 89
  • 1
8

If you want to rename multiple files in a single command ‒ let's say you want to convert all *.txt files ‒ here is the command:

find . -name "*.txt" -exec iconv -f WINDOWS-1252 -t UTF-8 {} -o {}.ren \; -a -exec mv {}.ren {} \;
Anthony O.
  • 22,041
  • 18
  • 107
  • 163
  • 2
    this converts **all files to UTF-8** without considering their encoding and will mess up files already in UTF-8 and is **not** what the OP wants – phuclv Apr 23 '22 at 03:21
2

Use the iconv command.

To make sure the file is in Windows-1252, open it in Notepad (under Windows), then click Save As. Notepad suggests current encoding as the default; if it's Windows-1252 (or any 1-byte codepage, for that matter), it would say "ANSI".

Seva Alekseyev
  • 59,826
  • 25
  • 160
  • 281
  • Opening each file would be an exhaustive process. I want to do the conversion for a large number of files. Is there any other way I could do this? – Sam Jan 06 '10 at 15:56
  • What language are the files in? The difference between Windows-1252 and UTF-8 only manifests on non-ASCII characters, i. e. on national ones. Any file is a valid Windows-1252 file, but without looking at the content and checking if the characters make sense in the target language you cannot tell if it's really Windows-1252. If the file has no extended characters, then the conversion would be trivial anyway, and you don't have to bother. – Seva Alekseyev Jan 06 '10 at 16:16
  • 1
    Addition: you can validate UTF-8 though. Even iconv can do that - convert a file from UTF-8 to UTF-16 and back; if it's not identical to the original, then UTF-8 it was not. Probably easy to do with creative pipelining. – Seva Alekseyev Jan 06 '10 at 16:26
  • And before you start, do some stats. How many files from the bulk actually do require conversion? – Seva Alekseyev Jan 06 '10 at 16:29
1

You can change the encoding of a file with an editor such as notepad++. Just go to Encoding and select what you want.

I always prefer the Windows 1252

thanos.a
  • 2,246
  • 3
  • 33
  • 29
  • 1
    Notepad++ is a Windows-only tool but the question is about Linux. – parsley72 Sep 25 '15 at 22:41
  • 2
    @parsley "I've copied certain files from a Windows machine" means that has access to windows machine as well. He can do this convert with a single menu option to all files or to a copy of all files before getting them to him Linux machine. You can revert the down vote. Thanks – thanos.a Oct 13 '15 at 08:57
  • Windows-1252 or ISO-8859-1 is always a bad idea in a Unicode world. Sharing files between systems became a problem because many applications assume files are always in UTF-8. Besides this isn't suitable for mass converting a large number of files – phuclv Apr 23 '22 at 03:19
0

If you are sure your files are either UTF-8 or Windows 1252 (or Latin1), you can take advantage of the fact that recode will exit with an error if you try to convert an invalid file.

While utf8 is valid Win-1252, the reverse is not true: win-1252 is NOT valid UTF-8. So:

recode utf8..utf16 <unknown.txt >/dev/null || recode cp1252..utf8 <unknown.txt >utf8-2.txt

Will spit out errors for all cp1252 files, and then proceed to convert them to UTF8.

I would wrap this into a cleaner bash script, keeping a backup of every converted file.

Before doing the charset conversion, you may wish to first ensure you have consistent line-endings in all files. Otherwise, recode will complain because of that, and may convert files which were already UTF8, but just had the wrong line-endings.

mivk
  • 13,452
  • 5
  • 76
  • 69
  • 2
    Only byte values 00-7F are the exact same in Windows-1252 and UTF-8. Byte values 80-FF have different meanings in Windows-1252 and UTF-8. So saying "utf8 is valid Win-1252" is only true for bytes 00-7F. – Remy Lebeau Aug 16 '12 at 23:41
  • They obviously have different "meaning", but all bytes in a UTF8 file can be "valid" (even if non-sensical) CP1252 characters. Anyway, the above works well for me in practice. – mivk May 11 '13 at 10:58
  • Actually, there are 5 byte values that are officially undefined in CP1252 but which have meaning in UTF-8: 0x81, 0x8D, 0x8F, 0x90, and 0x9D. However, Microsoft APIs map them to C1 control codes during text conversions. – Remy Lebeau May 11 '13 at 18:14
  • Guys, UTF8 does not always map to a single byte or even 2 bytes, take € for ezample. https://www.i18nqa.com/debug/utf8-debug.html – Jay Nov 10 '18 at 13:02
  • @Jay : Yes, of course not. But the question is about converting from cp1252 to UTF8, not the other way around. – mivk Nov 10 '18 at 13:19
  • There are more than 5 bytes though and in general more than 5 single byte encodings making it hard to say what the source encoding actually was... – Jay Nov 10 '18 at 13:21
0

this script worked for me on Win10/PS5.1 CP1250 to UTF-8

Get-ChildItem -Include *.php -Recurse | ForEach-Object {
    $file = $_.FullName

    $mustReWrite = $false
    # Try to read as UTF-8 first and throw an exception if
    # invalid-as-UTF-8 bytes are encountered.
    try
    {
        [IO.File]::ReadAllText($file,[Text.Utf8Encoding]::new($false, $true))
    }
    catch [System.Text.DecoderFallbackException]
    {
        # Fall back to Windows-1250
        $content = [IO.File]::ReadAllText($file,[Text.Encoding]::GetEncoding(1250))
        $mustReWrite = $true
    }

    # Rewrite as UTF-8 without BOM (the .NET frameworks' default)
    if ($mustReWrite)
    {
        Write "Converting from 1250 to UTF-8"
        [IO.File]::WriteAllText($file, $content)
    }
    else
    {
        Write "Already UTF-8-encoded"
    }
}
0

As said, you can't reliably determine whether a file is Windows-1252 because Windows-1252 maps almost all bytes to a valid code point. However if the files are only in Windows-1252 and UTF-8 and no other encodings then you can try to parse a file in UTF-8 and if it contains invalid bytes then it's a Windows-1252 file

if iconv -f UTF-8 -t UTF-16 $FILE 1>/dev/null 2>&1; then
    # Conversion succeeded
    echo "$FILE is in UTF-8"
else
    # iconv returns error if there are invalid characters in the byte stream
    echo "$FILE is in Windows-1252. Converting to UTF-8"
    iconv -f WINDOWS-1252 -t UTF-8 -o ${FILE}_utf8.txt $FILE
fi

This is similar to many other answers that try to treat the file as UTF-8 and check if there are errors. It works 99% of the time because most Windows-1252 texts will be invalid in UTF-8, but there will still be rare cases when it won't work. It's heuristic after all!

There are also various libraries and tools to detect the character set, such as chardet

$ chardet utf8.txt windows1252.txt iso-8859-1.txt
utf8.txt: utf-8 with confidence 0.99
windows1252.txt: Windows-1252 with confidence 0.73
iso-8859-1.txt: ISO-8859-1 with confidence 0.73

It can't be completely reliable due to the heuristic nature, so it outputs a confidence value for people to judge. The more human text in the file, the more confident it'll be. If you have very specific texts then more trainings for the library will be needed. For more information read How do browsers determine the encoding used?

phuclv
  • 37,963
  • 15
  • 156
  • 475
0

1. The files which are already in UTF-8 should not be changed 1

When I recently had this issue, I solved it by first finding all files in need of conversion.
I did this by excluding the files that should not be converted. This includes binary files, pure ASCII files (which by definition already have a valid UTF-8 encoding), and files that contain at least some valid non-ASCII UTF-8 characters.

In short, I recursively searched the files that probably should be converted :

$ find . -type f -name '*' -exec sh -c 'for n; do file -i "$n" | grep -Ev "binary|us-ascii|utf-8"; done' sh {} +

I had a subdirectory tree containing some 300 – 400 files. About half a dozen of them turned out to be wrongly encoded, and typically returned responses like :

./<some-path>/plain-text-file.txt: text/plain; charset=iso-8859-1
./<some-other-path>/text-file.txt: text/plain; charset=unknown-8bit

Note how the encoding was either iso-8859-1, or unknown-8bit.
This makes sense – any non-ASCII Windows-1252 character can either be a valid ISO 8859-1 character – or – it can be one of the 27 characters in the 128 – 159 (x80 – x9F) range for which no printable ISO 8859-1 characters are defined.

1. a. A caveat with the find . -exec solution 2

A problem with the find . -exec solution is that it can be very slow – a problem that grows with the size of the subdirectory tree under scrutiny.

In my experience, it might be faster – potentially much faster – to run a number of commands instead of the single command suggested above, as follows :

$ file -i * | grep -Ev "binary|us-ascii|utf-8"
$ file -i */* | grep -Ev "binary|us-ascii|utf-8"
$ file -i */*/* | grep -Ev "binary|us-ascii|utf-8"
$ file -i */*/*/* | grep -Ev "binary|us-ascii|utf-8"
$ …

Continue increasing the depth in these commands until the response is something like this:

*/*/*/*/*/*/*: cannot open `*/*/*/*/*/*/*' (No such file or directory)

Once you see cannot open / (No such file or directory), it is clear that the entire subdirectory tree has been searched.

2. Convert the culprit files

Now that all suspicious files have been found, I prefer to use a text editor to help with the conversion, instead of using a command line tool like recode.

2. a. On Windows, consider using Notepad++

On Windows, I like to use Notepad++ for converting files.
Have a look at this excellent post if you need help on that.

2. b. On Linux or macOS, consider using Visual Studio Code

On Linux and macOS, try VS Code for converting files. I've given a few hints in this post.

References


1 Section 1 relies on using the file command, which unfortunately isn't completely reliable. As long as all your files are smaller than 64 kB, there shouldn't be any problem. For files (much) larger than 64 kB, there is a risk that non-ASCII files will falsely be identified as pure ASCII files. The fewer non-ASCII characters in such files, the bigger the risk that they will be wrongly identified. For more on this, see this post and its comments.

2 Subsection 1. a. is inspired by this answer.

Henke
  • 4,445
  • 3
  • 31
  • 44
-1

Found this documentation for the TYPE command:

Convert an ASCII (Windows1252) file into a Unicode (UCS-2 le) text file:

For /f "tokens=2 delims=:" %%G in ('CHCP') do Set _codepage=%%G    
CHCP 1252 >NUL    
CMD.EXE /D /A /C (SET/P=ÿþ)<NUL > unicode.txt 2>NUL    
CMD.EXE /D /U /C TYPE ascii_file.txt >> unicode.txt    
CHCP %_codepage%    

The technique above (based on a script by Carlos M.) first creates a file with a Byte Order Mark (BOM) and then appends the content of the original file. CHCP is used to ensure the session is running with the Windows1252 code page so that the characters 0xFF and 0xFE (ÿþ) are interpreted correctly.

Owen Pauling
  • 11,349
  • 20
  • 53
  • 64
-1

UTF-8 does not have a BOM as it is both superfluous and invalid. Where a BOM is helpful is in UTF-16 which may be byte swapped as in the case of Microsoft. UTF-16 if for internal representation in a memory buffer. Use UTF-8 for interchange. By default both UTF-8, anything else derived from US-ASCII and UTF-16 are natural/network byte order. The Microsoft UTF-16 requires a BOM as it is byte swapped.

To covert Windows-1252 to ISO8859-15, I first convert ISO8859-1 to US-ASCII for codes with similar glyphs. I then convert Windows-1252 up to ISO8859-15, other non-ISO8859-15 glyphs to multiple US-ASCII characters.

  • in Windows a BOM in UTF-8 is **not** a BOM but a type of signature, because [There Ain't No Such Thing as Plain Text](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/), you must have a way to differentiate various types of text files. If a file has a UTF-8 BOM signature then Windows considers it a UTF-8 file, otherwise it's in ANSI encoding. In Linux only UTF-8 is used for all text files so there's no need for distinction – phuclv Apr 23 '22 at 03:13