Best way to convert text files between character sets?

Question

What is the fastest, easiest tool or method to convert text files between character sets?

Specifically, I need to convert from UTF-8 to ISO-8859-15 and vice versa.

Everything goes: one-liners in your favorite scripting language, command-line tools or other utilities for OS, web sites, etc.

Best solutions so far:

On Linux/UNIX/OS X/cygwin:

Gnu iconv suggested by Troels Arvin is best used as a filter. It seems to be universally available. Example:
```
  $ iconv -f UTF-8 -t ISO-8859-15 in.txt > out.txt
```
As pointed out by Ben, there is an online converter using iconv.
recode (manual) suggested by Cheekysoft will convert one or several files in-place. Example:
```
  $ recode UTF8..ISO-8859-15 in.txt
```
This one uses shorter aliases:
```
  $ recode utf8..l9 in.txt
```
Recode also supports surfaces which can be used to convert between different line ending types and encodings:

Convert newlines from LF (Unix) to CR-LF (DOS):
```
  $ recode ../CR-LF in.txt
```
Base64 encode file:
```
  $ recode ../Base64 in.txt
```
You can also combine them.

Convert a Base64 encoded UTF8 file with Unix line endings to Base64 encoded Latin 1 file with Dos line endings:
```
  $ recode utf8/Base64..l1/CR-LF/Base64 file.txt
```

On Windows with Powershell (Jay Bazuzi):

PS C:\> gc -en utf8 in.txt | Out-File -en ascii out.txt

(No ISO-8859-15 support though; it says that supported charsets are unicode, utf7, utf8, utf32, ascii, bigendianunicode, default, and oem.)

Edit

Do you mean iso-8859-1 support? Using "String" does this e.g. for vice versa

gc -en string in.txt | Out-File -en utf8 out.txt

Note: The possible enumeration values are "Unknown, String, Unicode, Byte, BigEndianUnicode, UTF8, UTF7, Ascii".

CsCvt - Kalytta's Character Set Converter is another great command line based conversion tool for Windows.

I tried `gc -en Ascii readme.html | Out-File -en UTF8 readme.html` but it converts the file to utf-8 but then it's empty! Notepad++ says the file is Ansi-format but reading up as I understand it that's not even a valid charset?? http://uk.answers.yahoo.com/question/index?qid=20100927014115AAiRExF — OZZIE, Sep 13 '13 at 12:24
Just come across this looking for an answer to a related question - great summary! Just thought it was worth adding that `recode` will act as a filter as well if you don't pass it any filenames, e.g.: `recode utf8..l9 < in.txt > out.txt` — Jez, Mar 06 '14 at 11:05
http://www.iconv.com/iconv.htm seems to be dead for me? (timeout) — Andrew Newby, May 12 '14 at 06:51
If you use `enca`, you do not need to specify the input encoding. It is often enough just to specify the language: `enca -L ru -x utf8 FILE.TXT`. — Alexander Pozdneev, Jul 31 '15 at 19:04
Actually, iconv worked much better as an in-place converter instead of a filter. Converting a file with more than 2 million lines using `iconv -f UTF-32 -t UTF-8 input.csv > output.csv` saved only about seven hundred thousand lines, only a third. Using the in-place version `iconv -f UTF-32 -t UTF-8 file.csv` converted successfully all 2 million plus lines. — Nicolay77, May 19 '16 at 23:04
encoding "ISO-8859-1" doesn't work for me, its "ISO8859-1"... if you want to see al encodigns available to transform, just type in console `iconv -l`... thanks for the help — Cocuba, Jun 15 '16 at 15:23
`find httpdocs -type f -exec recode ISO-8859-15..UTF8 {} \;` and pray you dont have issues with images. — sjas, Jan 18 '17 at 10:48
Thanks a lot for the summary. Much better than the answers IMHO. — xpt, Jul 12 '18 at 13:40
@Cocuba iconv recognizes 8859_1, ISO-8859-1, ISO8859-1, ISO88591, ISO_8859-1 (same for other ISO 8859 character encodings). Checked with iconv 2.27 (Ubuntu) — rmuller, Jun 21 '19 at 11:09
How do you convert to `LF`? There is `/CR` and `/CR-LF` but no `/LF`. — Aaron Franke, Mar 18 '20 at 10:24
iconv was barfing on my input "illegal input sequence" so I fed it one line at a time via bash while loop. Worked great as I was willing to discard the bad lines. — nortally, Jun 22 '22 at 15:53

score 298 · Answer 1 · edited Nov 12 '20 at 16:18

298

Stand-alone utility approach

iconv -f ISO-8859-1 -t UTF-8 in.txt > out.txt

-f ENCODING  the encoding of the input
-t ENCODING  the encoding of the output

You don't have to specify either of these arguments. They will default to your current locale, which is usually UTF-8.

edited Nov 12 '20 at 16:18

Naman

27,789
26
218
353

answered Sep 15 '08 at 17:24

Troels Arvin

6,238
2
24
27

4

For anyone else who's getting tripped up by the non-dash versions being unavailable, it looks like OSX (and possibly all BSD) versions of iconv don't support the non-dash aliases for the various UTF-* encodings. `iconv -l | grep UTF` will tell you all the UTF-related encodings that your copy of iconv does support. – coredumperror May 02 '12 at 19:10
23

Don't know the encoding of your input file? Use `chardet in.txt` to generate a best guess. The result can be used as ENCODING in `iconv -f ENCODING`. – Stew Sep 16 '14 at 16:45
4

Prevent exit at invalid characters (avoiding `illegal input sequence at position` messages), and replace "weird" characters with "similar" characters: `iconv -c -f UTF-8 -t ISO-8859-1//TRANSLIT in.txt > out.txt`. – knb Feb 06 '15 at 11:07
I like this because it's standard on most NIX platforms. But also see the VIM command option (alias: `ex`) [below](http://stackoverflow.com/a/32861628/2114313). Additional info: (1) you (probably) don't need to specify the `-f` (from) option with `iconv`. (2) the `file --mime-encoding ` command can help you to figure out the encoding in the first place. – frIT Jan 15 '16 at 11:37
1

FWIW the `file` command reported my source as UTF-16 Little Endian; running `iconv -f UTF-16 -t UTF-8...` transformed it incorrectly to ASCII, i had to explicitly specify `iconv -f UTF-16LE...` to output UTF-8 – Plato Dec 14 '16 at 23:04
For anyone wondering: for a file with about 6M rows (2.3 GB) it took 7 min to convert – tbotalla Oct 07 '21 at 04:38
Very fast, even with a 4GB file. – Fabien Haddadi Jan 06 '22 at 17:59
I get confused by the result, I have a plain english text file (`[-:/a-zA-Z0-9]`) and trying to convert the encoding from ASCII to UTF-8, neither `iconv` nor `:set fileencoding=ute-8` in vim works. the result of `file` still gives `ascii` – jimmymcheung Jun 03 '23 at 17:16
1

@jimmymcheung: Plain 7-bit ASCII is by definition the exact same as Unicode. So it's expected. – Troels Arvin Jun 04 '23 at 15:44

score 109 · Answer 2 · edited Sep 01 '21 at 16:45

109

Try VIM

If you have vim you can use this:

Not tested for every encoding.

The cool part about this is that you don't have to know the source encoding

vim +"set nobomb | set fenc=utf8 | x" filename.txt

Be aware that this command modify directly the file

Explanation part!

+ : Used by vim to directly enter command when opening a file. Usualy used to open a file at a specific line: vim +14 file.txt
| : Separator of multiple commands (like ; in bash)
set nobomb : no utf-8 BOM
set fenc=utf8 : Set new encoding to utf-8 doc link
x : Save and close file
filename.txt : path to the file
" : qotes are here because of pipes. (otherwise bash will use them as bash pipe)

edited Sep 01 '21 at 16:45

Cometsong

568
9
21

answered Sep 30 '15 at 08:41

Boop

1,217
1
15
22

1

Quite cool, but somewhat slow. Is there a way to change this to convert a number of files at once (thus saving on vim's initialization costs)? – DomQ Apr 25 '16 at 08:20
Thank you for explanation! I was having a difficult time with beginning of the file until I read up about the bomb/nobomb setting. – jjwdesign Oct 03 '16 at 13:34
1

np, additionaly you can view the bom if you use `vim -b` or `head file.txt|cat -e` – Boop Oct 03 '16 at 13:38
1

for example: `find -regextype posix-extended -type f -regex ".*\.(h|cpp|rc|fx|cs|props|xaml)" -exec vim +'set nobomb | set fenc=utf8 | x' {} \;` – Gabriel Apr 06 '17 at 08:48
I used this to convert the encoding of CSV files and was really excited when I saw the charset had indeed changed. Unfortunately, when I went to load the file into MySQL, it had a different number of columns than what it previously had before running the vim command. Wonder if it would be possible to just open the file, convert the encoding, and save/close the file while leaving all other file content the same? – NightOwlPrgmr Apr 28 '17 at 15:00
many ways: 1 - Use @Gabriel's command , 2 - Shell expansion `vim +'set nobomb | set fenc=utf8 | x' *.yaml` (e.g.), 3 - A loop `for f in a.txt b.txt; do vim +'set nobomb | set fenc=utf8 | x' "${f}"; done` (none of theses has been tested) – Boop Sep 20 '19 at 08:36

score 39 · Answer 3 · answered Sep 15 '08 at 17:24

39

Under Linux you can use the very powerful recode command to try and convert between the different charsets as well as any line ending issues. recode -l will show you all of the formats and encodings that the tool can convert between. It is likely to be a VERY long list.

answered Sep 15 '08 at 17:24

Cheekysoft

35,194
20
73
86

1

How do you convert to `LF`? There is `/CR` and `/CR-LF` but no `/LF` – Aaron Franke Mar 19 '20 at 20:56

score 24 · Answer 4 · edited Sep 25 '15 at 22:06

24

iconv(1)

iconv -f FROM-ENCODING -t TO-ENCODING file.txt

Also there are iconv-based tools in many languages.

edited Sep 25 '15 at 22:06

whoan

8,143
4
39
48

answered Sep 15 '08 at 17:23

Daniel Papasian

16,145
6
29
32

2

What about auto-detecting the original encoding? – Aaron Franke Mar 19 '20 at 20:55

score 24 · Answer 5 · edited Feb 27 '14 at 15:28

24

Get-Content -Encoding UTF8 FILE-UTF8.TXT | Out-File -Encoding UTF7 FILE-UTF7.TXT

The shortest version, if you can assume that the input BOM is correct:

gc FILE.TXT | Out-File -en utf7 file-utf7.txt

edited Feb 27 '14 at 15:28

David Martin

11,764
1
61
74

answered Sep 15 '08 at 17:29

Jay Bazuzi

45,157
15
111
168

1

Here's a shorter version that works better. `gc .\file-utf8.txt | sc -en utf7 .\file-utf7.txt` – Larry Battle Jul 15 '12 at 06:16
@LarryBattle: How does `Set-Content` work better than `Out-File`? – Jay Bazuzi Jul 15 '12 at 19:30
...oh. I guess they're nearly the same thing. I had trouble running your example because I was assuming that both versions were using the same `file-utf8.txt` file for input since they both had the same output file as `file-utf7.txt`. – Larry Battle Jul 15 '12 at 21:24
This would be really great, except that it doesn't support UTF16. It supports UTF32, but not UTF16! I wouldn't need to convert files, except that a lot of Microsoft software (f.e. SQL server bcp) insists on UTF16 - and then their utility won't convert to it. Interesting to say the least. – Noah Aug 22 '13 at 01:45
I tried `gc -en Ascii readme.html | Out-File -en UTF8 readme.html` but it converts the file to utf-8 but then it's empty! Notepad++ says the file is Ansi-format but reading up as I understand it that's not even a valid charset?? http://uk.answers.yahoo.com/question/index?qid=20100927014115AAiRExF – OZZIE Sep 13 '13 at 12:23
@OZZIE I don't think you can edit a file in place like that. Try saving the content to a temporary file first – rob Nov 19 '13 at 21:01

score 18 · Answer 6 · edited Jun 20 '20 at 09:12

18

Try Notepad++

On Windows I was able to use Notepad++ to do the conversion from ISO-8859-1 to UTF-8. Click "Encoding" and then "Convert to UTF-8".

edited Jun 20 '20 at 09:12

Community

1
1

answered Jun 07 '12 at 14:30

Jeremy Glover

608
1
9
10

score 18 · Answer 7 · edited Jun 20 '20 at 09:12

18

Try iconv Bash function

I've put this into .bashrc:

utf8()
{
    iconv -f ISO-8859-1 -t UTF-8 $1 > $1.tmp
    rm $1
    mv $1.tmp $1
}

..to be able to convert files like so:

utf8 MyClass.java

edited Jun 20 '20 at 09:12

Community

1
1

answered Dec 06 '11 at 14:43

Arne Evertsson

19,693
20
69
84

10

it's better style to use tmp=$(mktmp) to create a temporary file. Also, the line with rm is redundant. – LMZ Feb 26 '15 at 22:20
1

can you complete this function with auto detect input format? – mlibre Apr 20 '16 at 20:28
4

beware, this function deletes the input file without verifying that the iconv call succeeded. – philwalk Dec 05 '17 at 19:48
This changes the contents of the text file. I ran this on a UTF-8 with BOM expecting to get out a UTF-8 without BOM file, but it prepended `ï»¿` at the start of the file. – Aaron Franke Mar 19 '20 at 20:53

score 16 · Answer 8 · edited Jun 20 '20 at 09:12

16

Oneliner using find, with automatic character set detection

The character encoding of all matching text files gets detected automatically and all matching text files are converted to utf-8 encoding:

$ find . -type f -iname *.txt -exec sh -c 'iconv -f $(file -bi "$1" |sed -e "s/.*[ ]charset=//") -t utf-8 -o converted "$1" && mv converted "$1"' -- {} \;

To perform these steps, a sub shell sh is used with -exec, running a one-liner with the -c flag, and passing the filename as the positional argument "$1" with -- {}. In between, the utf-8 output file is temporarily named converted.

Whereby file -bi means:

-b, --brief Do not prepend filenames to output lines (brief mode).
-i, --mime Causes the file command to output mime type strings rather than the more traditional human readable ones. Thus it may say for example text/plain; charset=us-ascii rather than ASCII text. The sed command cuts this to only us-ascii as is required by iconv.

The find command is very useful for such file management automation. Click here for more find galore.

edited Jun 20 '20 at 09:12

Community

1
1

answered Aug 28 '16 at 19:46

Serge Stroobandt

28,495
9
107
102

4

I had to adapt this solution a bit to work on Mac OS X, at least at my version. `find . -type f -iname *.txt -exec sh -c 'iconv -f $(file -b --mime-encoding "$1" | awk "{print toupper(\$0)}") -t UTF-8 > converted "$1" && mv converted "$1"' -- {} \;` – Brian J. Miller Jan 20 '17 at 20:07
1

Your code worked on Windows 7 with MinGW-w64 (latest version) too. Thanks for sharing it! – silvioprog Jan 06 '18 at 19:05
@rmuller The `sed` command is there on purpose, enabling the automatic detection of character encoding. I have expanded the answer to explain this now. It would be courteous with regards to the readership to delete any remaining irrelevant comments. Thank you. – Serge Stroobandt Jun 22 '19 at 18:16
@SergeStroobandt Maybe i was not clear enough. My point is when you use "file -b --mime-encoding" instead of "file -bi" there is no need for filtering the result with sed. This command already returns the file encoding only. So in your example "us-ascii" – rmuller Jun 23 '19 at 15:31
This doesn't actually seem to do anything for me on Linux. I saved a file as UTF-8 with BOM and expected it to convert to UTF-8 without BOM and it didn't. – Aaron Franke Mar 19 '20 at 20:50
that's what I get when I run it: ```Usage: iconv [-c] [-s] [-f fromcode] [-t tocode] [file ...] or: iconv -l Try 'iconv --help' for more information.``` – paradox Aug 26 '21 at 13:28

score 8 · Answer 9 · answered Oct 05 '20 at 18:14

8

Assuming, you don't know the input encoding and still wish to automate most of the conversion, I concluded this one liner from summing up previous answers.

iconv -f $(chardetect input.text | awk '{print $2}') -t utf-8 -o output.text

answered Oct 05 '20 at 18:14

Marcelo Ruggeri

1,939
1
17
10

Is there any Windows alternative method about this method? Thanks. – ollydbg23 Jul 20 '22 at 14:46

score 5 · Answer 10 · edited Mar 08 '18 at 15:45

5

DOS/Windows: use Code page

chcp 65001>NUL
type ascii.txt > unicode.txt

Command chcp can be used to change the code page. Code page 65001 is Microsoft name for UTF-8. After setting code page, the output generated by following commands will be of code page set.

edited Mar 08 '18 at 15:45

AJ.

13,461
19
51
63

answered Jun 27 '17 at 19:33

lalthomas

442
5
14

exactly what the Doc. ordered – Ali80 Nov 11 '20 at 15:24

Amr Ali · Answer 11 · 2020-07-19T05:04:04.460

4

Try EncodingChecker

EncodingChecker on github

File Encoding Checker is a GUI tool that allows you to validate the text encoding of one or more files. The tool can display the encoding for all selected files, or only the files that do not have the encodings you specify.

File Encoding Checker requires .NET 4 or above to run.

For encoding detection, File Encoding Checker uses the UtfUnknown Charset Detector library. UTF-16 text files without byte-order-mark (BOM) can be detected by heuristics.

edited Jul 19 '20 at 05:04

answered Jul 19 '20 at 04:53

Amr Ali

3,020
1
16
11

Very nice tool, it can also convert the detected encoding to the user specified encoding. – ollydbg23 Jul 20 '22 at 14:59
How do you install it? I didn't find any instructions. – Henke Mar 10 '23 at 17:58

score 3 · Answer 12 · answered Sep 17 '08 at 06:18

3

PHP iconv()

iconv("UTF-8", "ISO-8859-15", $input);

answered Sep 17 '08 at 06:18

user15096

55
1

1

This statement works great when converting strings, but not for files. – jjwdesign Oct 03 '16 at 13:36

score 1 · Answer 13 · answered Nov 28 '16 at 19:32

to write properties file (Java) normally I use this in linux (mint and ubuntu distributions):

$ native2ascii filename.properties

For example:

$ cat test.properties 
first=Execução número um
second=Execução número dois

$ native2ascii test.properties 
first=Execu\u00e7\u00e3o n\u00famero um
second=Execu\u00e7\u00e3o n\u00famero dois

PS: I writed Execution number one/two in portugues to force special characters.

In my case, in first execution I received this message:

$ native2ascii teste.txt 
The program 'native2ascii' can be found in the following packages:
 * gcj-5-jdk
 * openjdk-8-jdk-headless
 * gcj-4.8-jdk
 * gcj-4.9-jdk
Try: sudo apt install <selected package>

When I installed the first option (gcj-5-jdk) the problem was finished.

I hope this help someone.

score 1 · Answer 14 · answered Jun 26 '18 at 06:25

1

With ruby:

ruby -e "File.write('output.txt', File.read('input.txt').encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: ''))"

Source: https://robots.thoughtbot.com/fight-back-utf-8-invalid-byte-sequences

answered Jun 26 '18 at 06:25

Dorian

22,759
8
120
116

score 1 · Answer 15 · answered Oct 09 '18 at 16:45

1

Simply change encoding of loaded file in IntelliJ IDEA IDE, on the right of status bar (bottom), where current charset is indicated. It prompts to Reload or Convert, use Convert. Make sure you backed up original file in advance.

answered Oct 09 '18 at 16:45

Nikolai Varankine

71
1
5

Amr Ali · Answer 16 · 2020-07-30T20:26:15.537

In powershell:

function Recode($InCharset, $InFile, $OutCharset, $OutFile)  {
    # Read input file in the source encoding
    $Encoding = [System.Text.Encoding]::GetEncoding($InCharset)
    $Text = [System.IO.File]::ReadAllText($InFile, $Encoding)
    
    # Write output file in the destination encoding
    $Encoding = [System.Text.Encoding]::GetEncoding($OutCharset)    
    [System.IO.File]::WriteAllText($OutFile, $Text, $Encoding)
}

Recode Windows-1252 "$pwd\in.txt" utf8 "$pwd\out.txt"

For a list of supported encoding names:

https://learn.microsoft.com/en-us/dotnet/api/system.text.encoding

score 1 · Answer 17 · answered Aug 18 '20 at 09:34

1

There is also a web tool to convert file encoding: https://webtool.cloud/change-file-encoding

It supports wide range of encodings, including some rare ones, like IBM code page 37.

answered Aug 18 '20 at 09:34

Pavel Morshenyuk

10,891
4
32
38

score 0 · Answer 18 · answered Jul 01 '18 at 10:17

0

Use this Python script: https://github.com/goerz/convert_encoding.py Works on any platform. Requires Python 2.7.

answered Jul 01 '18 at 10:17

kinORnirvana

1,667
2
17
22

score 0 · Answer 19 · answered Sep 17 '18 at 11:08

My favorite tool for this is Jedit (a java based text editor) which has two very convenient features :

One which enables the user to reload a text with a different encoding (and, as such, to control visually the result)
Another one which enables the user to explicitly choose the encoding (and end of line char) before saving

score 0 · Answer 20 · answered Nov 30 '19 at 18:49

If macOS GUI applications are your bread and butter, SubEthaEdit is the text editor I usually go to for encoding-wrangling — its "conversion preview" allows you to see all invalid characters in the output encoding, and fix/remove them.

And it's open-source now, so yay for them .

score 0 · Answer 21 · answered Mar 31 '22 at 12:23

Visual Studio Code

Open your file in Visual Studio Code
Reopen with Encoding: In the bottom status bar, to the right, you should see your current file encoding (eg "UTF-8"). Click this and select "Reopen with Encoding".
Select the correct encoding of the file (eg: ISO 8859-2).
Confirm that your content is displaying as expected.
Save with Encoding: The bottom status bar should now display your new encoding format (eg: ISO 8859-2). Click this and choose "Save with Encoding" and select UTF-8 (or whatever new encoding you want).

NOTE: THIS WILL OVERWRITE YOUR ORGINIAL FILE. MAKE A BACKUP FIRST.

Best way to convert text files between character sets?

Best solutions so far:

Edit

21 Answers21

Try VIM

Explanation part!

Try Notepad++

Try iconv Bash function

Oneliner using find, with automatic character set detection

Visual Studio Code

Linked

Related