Character encoding in Excel spreadsheet (and what Java charset to use to decode it)

Question

I am using the JExcel library to read excel spreadsheets. Each cell on the spreadsheet may contain localization strings in any of something like 44 languages (English, Portugese, French, Chinese, etc). Today I don't tell the API anything regarding the encoding its supposed to use. Its handling the Chinese OK, but it always screws up Portugese and German. Somehow the default encoding (MacRoman on my dev box, UTF-8 on production) is failing to properly interpret the strings it pulls out of the excel workbook. There has to be something wrong with how JExcel is interpreting the character encoding of the file.

That being said...

Are all the strings in an excel workbook encoded with the same character set?

Is there workbook meta-data I can ask what this character set is (I haven't found it yet)?

If I run all the cells through something like jchardet (http://jchardet.sourceforge.net/), is it likely to be able to divine the character encoding for the whole workbook (this is pretty much predicated on the first question being "yes, all stings in a given workbook are encoded with the same character set")?

So many questions, so little time.

`.xlsx` files are really just XML files, which (I would _think_) means that there's only a single encoding for the whole file. `.xls`, on the other hand is (again, I _think_) a binary format, so I'm not sure if each cell could have its own character encoding... — Matt Ball, Sep 16 '11 at 19:20
I think you are right, Matt. XLS is a binary format. I've also just had a "oh crap" moment in my logic above; the JExcel API requires me to set workbook encoding before I parse. I was thinking I could parse to figure out encoding. Rock. Hard Place. — Bob Kuhar, Sep 16 '11 at 19:22
Found the XLS encoding spec, thanks to Wikipedia: http://sc.openoffice.org/excelfileformat.pdf — Matt Ball, Sep 16 '11 at 19:24

score 10 · Accepted Answer · answered Sep 17 '11 at 01:05

10

Well I didn't get an answer directly, but Matt's discovery of a spec points the way towards an actual answer: http://sc.openoffice.org/excelfileformat.pdf

In the mean time, my problem went away by just setting the encoding to always be "Cp1252". I'm not sure exactly why, but I'm not looking a gift horse in the mouth, so to speak, and am moving on.

    WorkbookSettings workbookSettings = new WorkbookSettings();
    workbookSettings.setEncoding( "Cp1252" );
    Workbook.getWorkbook( theFile, workbookSettings );

I'll call this one answered.

answered Sep 17 '11 at 01:05

Bob Kuhar

10,838
11
62
115

http://stackoverflow.com/questions/508558/what-charset-does-microsoft-excel-use-when-saving-files could also have additional information here. – VonC Aug 31 '12 at 15:01
Your answer saved hours with my PHP program...thanks – Asad Hasan Feb 28 '14 at 04:48

score 1 · Answer 2 · answered Jun 19 '13 at 14:34

I have the problem that, while reading cell values from the excel file, some values appeared with "?" as this corresponds to letters with accent... Would that code resolve this issue ?. Because as I am running under windows, I cannot test as fast as If I would be under Linux (which is the SO of the server where I'm deploying to)...

Character encoding in Excel spreadsheet (and what Java charset to use to decode it)

2 Answers2

Linked