69

I have a Java app which reads CSV files which have been created in Excel (e.g. 2007). Does anyone know what charset MS Excel uses to save these files in?

I would have guessed either:

  • windows-1255 (Cp1255)
  • ISO-8859-1
  • UTF8

but I am unable to decode extended chars (e.g. french accentuated letters) using either of these charset types.

Rick
  • 7,007
  • 2
  • 49
  • 79
Joel
  • 29,538
  • 35
  • 110
  • 138

11 Answers11

45

From memory, Excel uses the machine-specific ANSI encoding. So this would be Windows-1252 for a EN-US installation, 1251 for Russian, etc.

devstuff
  • 8,277
  • 1
  • 27
  • 33
  • True but Excel 2007 allows the user (if they can find the Tools button!) to choose from a long list; "ANSI" is the default. – John Machin May 18 '10 at 20:44
  • 10
    That's how it should work, but the feature is broken. Instead of honoring the "save this document as:" option you select in the save-as dialog/tools menu/web option item/encoding tab, Excel just uses the code-page set in your Region and Language settings. – Triynko Feb 07 '12 at 18:20
  • This answer is fairly true. I tested on a Chinese language machine and default csv(not the utf8 option) is saved as GB2312. Things are so "defaulted" on Windows without notification. Btw, why isn't this answer to be the accpeted answer? – Rick May 22 '18 at 07:44
18

CSV files could be in any format, depending on what encoding option was specified during the export from Excel: (Save Dialog, Tools Button, Web Options Item, Encoding Tab)

UPDATE: Excel (including Office 2013) doesn't actually respect the web options selected in the "save as..." dialog, so this is a bug of some sort. I just use OpenOffice Calc now to open my XLSX files and export them as CSV files (edit filter settings, choose UTF-8 encoding).

Triynko
  • 18,766
  • 21
  • 107
  • 173
  • 4
    and the default encoding is "ANSI", usually one of cp1250 to cp1258. – John Machin May 18 '10 at 20:43
  • My default encoding is UTF-8, but that's probably because I've selected it in the past and Excel is just remembering my selection from the last time I saved a CSV file. This selection seems to persist even after closing and re-launching Excel. – Triynko Oct 07 '11 at 20:50
  • 14
    The "Web Options" does not apply to CSV exports (at least with Excel 2007). The encoding of the CSV file seems to follow default encoding of the installation. So it can be anything, in practice. Sadly, there does not seem to be a way to control this at export-time. – gawi Nov 15 '11 at 17:55
  • 13
    Confirmed that the Web Options setting doesn't apply to CSV exports in Excel 2010 as well. Please downvote this answer, it is incorrect. – Russell Davis Feb 03 '12 at 21:33
  • 4
    Instead of down-voting the answer, which is technically correct, why don't we all file a bug report @Microsoft about this broken feature. The "encoding option [that] was specified during the export from Excel" is the one in your computer's Region and Language settings, rather than the encoding tab in the web options. The encoding tab option says "save this document as:", so it should apply to the document being saved, for any text-based format to which text-encoding applies. The workaround is to change your Region and Language settings, but it requires a restart, and still uses code-pages. – Triynko Feb 07 '12 at 18:17
  • 4
    Yet another reason I've switched to Open Office, which handles encoding correctly. When you select "Text CSV", you can check the "Edit filter settings" checkbox, and it will display a window that allows you to choose the Character Set (which defaults to Western Europe (Windows-1252)), the Field Delimiter (comma, semi-colon, colon, space, or tab), and the Text Delimiter (double quote or single quote). The contrast between how these two similar pieces of software handle a CSV export is mind-blowing. OpenOffice.org Calc destroys Excel. – Triynko Feb 07 '12 at 18:48
  • 2
    Until Microsoft gets their act together on this (still messed up in Excel 2013), I've switched to OpenOffice Calc for opening csv and dbf files that contain international characters. I work with a lot of mapping data, and OpenOffice deals very nicely with the data files. – Leta Feb 05 '14 at 07:17
  • That's exactly what I've switched to for exporting proper UTF-8 CSV files. I'm still planning to file that bug report for Excel (5 years later, lol). – Triynko Feb 20 '15 at 09:07
  • 1
    Probably the safest way is to save as 'Unicode Text'. This gives you tab delimited UTF-8 file. – besimple Mar 03 '15 at 23:33
  • Excel for Mac saves in Mac Roman encoding, an entirely different scheme that is now pretty much foreign to Mac, just to muck up everything. – jbarlow May 05 '15 at 08:34
  • This answer describes how to do this with saving as "Unicode Text", gives you UTF16 TSV, which you can then convert in a text editor if required (but other tools may be able to read that directly as well): https://stackoverflow.com/a/6966518/1806841 – FelixD Dec 18 '18 at 09:16
  • In Excel 365 v1901 there is the option to save as `CSV UTF-8`. It worked for me. – Wesley Gonçalves Mar 01 '19 at 15:32
  • In Windows 10 (version 1803) there is a setting under Current system locale (Settings > Time & Language > Region & language > Administrative language settings > Change system locale...) "Beta: Use Unicode UTF-8 for worldwide language support" - enabling this causes Excel 2010 to correctly save CSV files encoded as UTF-8. – Jake Sep 20 '19 at 14:56
15

Waking up this old thread... We are now in 2017. And still Excel is unable to save a simple spreadsheet into a CSV format while preserving the original encoding ... Just amazing.

Luckily Google Docs lives in the right century. The solution for me is just to open the spreadsheet using Google Docs, than download it back down as CSV. The result is a correctly encoded CSV file (with all strings encoded in UTF8).

Albert Godfrind
  • 1,978
  • 1
  • 12
  • 20
4

I had a similar problem last week. I received a number of CSV files with varying encodings. Before importing into the database I then used the chardet libary to automatically sniff out the correct encoding.

Chardet is a port from Mozillas character detection engine and if the sample size is large enough (one accentuated character will not do) works really well.

pi.
  • 21,112
  • 8
  • 38
  • 59
  • chardet is helpful ( I tested under GB2312) but the original link is deprecated. Here's the latest github link: https://github.com/chardet/chardet – Rick May 22 '18 at 09:01
2

Russian Edition offers CSV, CSV (Macintosh) and CSV (DOS).

When saving in plain CSV, it uses windows-1251.

I just tried to save French word Résumé along with the Russian text, it saved it in HEX like 52 3F 73 75 6D 3F, 3F being the ASCII code for question mark.

When I opened the CSV file, the word, of course, became unreadable (R?sum?)

Quassnoi
  • 413,100
  • 91
  • 616
  • 614
  • Same for me. I can only save characters that within my system language charset/encoding. Otherwise I have to save as a utf8 csv file. – Rick May 22 '18 at 07:48
0

Excel 2010 saves an UTF-16/UCS-2 TSV file, if you select File > Save As > Unicode Text (.txt). It's (force) suffixed ".txt", which you can change to ".tsv".

If you need CSV, you can then convert the TSV file in a text editor like Notepad++, Ultra Edit, Crimson Editor etc, replacing tabs by semi-colons, commas or the like. Note that e.g. for reading into a DB table, often TSV works fine already (and it is often easier to read manually).

If you need a different code page like UTF-8, use one of the above mentioned editors for converting.

FelixD
  • 629
  • 7
  • 17
-1

cp1250 is used extensively in Microsoft Office documents, including Word and Excel 2003.

http://en.wikipedia.org/wiki/Windows-1250

A simple way to confirm this would be to:

  1. Create a spreadsheet with higher order characters, e.g. "Veszprém" in one of the cells;
  2. Use your favourite scripting language to parse and decode the spreadsheet;
  3. Look at what your script produces when you print out the decoded data.

Example perl script:

#!perl

use strict;

use Spreadsheet::ParseExcel::Simple;
use Encode qw( decode );

my $file    = "my_spreadsheet.xls";

my $xls     = Spreadsheet::ParseExcel::Simple->read( $file );
my $sheet   = [ $xls->sheets ]->[0];

while ($sheet->has_data) {

    my @data = $sheet->next_row;

    for my $datum ( @data ) {
        print decode( 'cp1250', $datum );
    }

}
robbiebow
  • 89
  • 1
  • 5
  • 5
    "cp1250 is used extensively in Microsoft Office documents" ... in YOUR neck of the woods. cp125n is used extensively world-wide, for 0 <= n <= 8. – John Machin May 18 '10 at 20:40
-1

You could use this Visual Studio VB.Net code to get the encoding:

Dim strEncryptionType As String = String.Empty
Dim myStreamRdr As System.IO.StreamReader = New System.IO.StreamReader(myFileName, True)
Dim myString As String = myStreamRdr.ReadToEnd()
strEncryptionType = myStreamRdr.CurrentEncoding.EncodingName
kttii
  • 107
  • 13
  • `StreamReader` doesn't detect the encoding, it [defaults to UTF-8](https://learn.microsoft.com/en-us/dotnet/api/system.io.streamreader?view=netframework-4.8#remarks). So this will always return UTF-8. – Gordon Leigh Jan 05 '20 at 09:12
-1

You can create CSV file using encoding UTF8 + BOM (https://en.wikipedia.org/wiki/Byte_order_mark).

First three bytes are BOM (0xEF,0xBB,0xBF) and then UTF8 content.

-2

While it is true that exporting an excel file that contains special characters to csv can be a pain in the ass, there is however a simple work around: simply copy/paste the cells into a google docs and then save from there.

Joan-Diego Rodriguez
  • 2,439
  • 1
  • 27
  • 29
  • 5
    Sounds like they were looking for a programmatic solution. Copy and paste does not really "fit the bill". Also, this question is *quite* old. Best to avoid resurrecting old threads unless the response contributes something significant over previous responses. ) – Leigh Jun 15 '12 at 02:48
  • 1
    Well, if you say so. However the best rated answer doesn't work (see comments) and it is not a programmatic solution either. Moreover, my solution works. Finally, the topic might be old, it still pops up on the first page when googling the issue, and there is no working solution on the thread but mine. – Joan-Diego Rodriguez Jun 24 '12 at 14:45
  • Actually the comments say it is bug. They also suggest two alternatives, obviously not ideal: OpenOffice or modifying the region and language settings. So I am not sure how this suggestion is a significant improvement over those already given. Now if there were a fix in later versions, or a programmatic alternative, I am sure others would love to hear it. – Leigh Jun 25 '12 at 19:49
  • If you want to convert programmatically, just pass the document through the `iconv` command. Simple and easy. `iconv` ports for Windows exist. See http://www.fileformat.info/tip/linux/iconv.htm – Craig Ringer Mar 21 '13 at 01:14
-3

OOXML files like those that come from Excel 2007 are encoded in UTF-8, according to wikipedia. I don't know about CSV files, but it stands to reason it would use the same format...

rmeador
  • 25,504
  • 18
  • 62
  • 103