40

Possible Duplicate:
Excel to CSV with UTF8 encoding

Scenario: I have an excel file containing a large amount of global customer data. I do not know what encoding was used when the file was created.

Question: How can I determine the character encoding used in the excel file so I can import it correctly into another piece of software?

Community
  • 1
  • 1
samaspin
  • 2,342
  • 1
  • 26
  • 31
  • I guess that your problem is discussed and answered in http://superuser.com/questions/280603/how-to-set-character-encoding-when-opening-excel – Jüri Ruut Nov 05 '12 at 15:42
  • 4
    @JüriRuut Not really, this question is the other way around. And I'd like a canonical answer on this as well, so +1 to the question. – deceze Nov 05 '12 at 15:55
  • @deceze: then it would be "export data from Excel"? – Jüri Ruut Nov 05 '12 at 15:56
  • @JüriRuut I'm assuming he means "reading an .xls file using some library in some programming language". Then it all makes sense... Sam, correct this assumption if I'm wrong. – deceze Nov 05 '12 at 15:59
  • OK, then it's in this thread: http://stackoverflow.com/questions/4221176/excel-to-csv-with-utf8-encoding – Jüri Ruut Nov 05 '12 at 16:03
  • 3
    @deceze - you are spot-on! In order to import the file correctly I first need to know how it was originally encoded. If you import it and just assume a certain character set was used you could end up bad data - certain characters being lost or replaced with other characters unintentionally. – samaspin Nov 05 '12 at 16:52
  • @Jüri Ruut - thanks for trying but neither of those threads answer the question. – samaspin Nov 05 '12 at 16:53
  • Could be, but then the question should be little bit more detailed. Otherwise "another piece of instructions" could be fully sufficient for "large amount of global data" exported/imported by "another piece of software". – Jüri Ruut Nov 05 '12 at 19:16
  • I guess this spec should help here http://sc.openoffice.org/excelfileformat.pdf. See section `5.17 CODEPAGE`. – Dfr Jun 14 '13 at 15:04

1 Answers1

10

For Excel 2010 it should be UTF-8. Instruction by MS :
http://msdn.microsoft.com/en-us/library/bb507946:

"The basic document structure of a SpreadsheetML document consists of the Sheets and Sheet elements, which reference the worksheets in the Workbook. A separate XML file is created for each Worksheet. For example, the SpreadsheetML for a workbook that has two worksheets name MySheet1 and MySheet2 is located in the Workbook.xml file and is shown in the following code example.

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?> 
<workbook xmlns=http://schemas.openxmlformats.org/spreadsheetml/2006/main xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
    <sheets>
        <sheet name="MySheet1" sheetId="1" r:id="rId1" /> 
        <sheet name="MySheet2" sheetId="2" r:id="rId2" /> 
    </sheets>
</workbook>

The worksheet XML files contain one or more block level elements such as SheetData. sheetData represents the cell table and contains one or more Row elements. A row contains one or more Cell elements. Each cell contains a CellValue element that represents the value of the cell. For example, the SpreadsheetML for the first worksheet in a workbook, that only has the value 100 in cell A1, is located in the Sheet1.xml file and is shown in the following code example.

<?xml version="1.0" encoding="UTF-8" ?> 
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
    <sheetData>
        <row r="1">
            <c r="A1">
                <v>100</v> 
            </c>
        </row>
    </sheetData>
</worksheet>

"

Detection of cell encodings:

https://metacpan.org/pod/Spreadsheet::ParseExcel::Cell

http://forums.asp.net/t/1608228.aspx/1

szabgab
  • 6,202
  • 11
  • 50
  • 64
Jüri Ruut
  • 2,500
  • 17
  • 20
  • 1
    how are you supposed to find these XML files for a given Excel file? – user5359531 Nov 02 '16 at 22:59
  • 1
    @user5359531: [http://stackoverflow.com/questions/8984254/how-to-view-the-xml-form-of-an-excel-file](http://stackoverflow.com/questions/8984254/how-to-view-the-xml-form-of-an-excel-file) – Scarabee Nov 05 '16 at 00:06
  • 1
    I am wondering if this is still an accurate way to determine the character encoding of an Excel sheet then, because I have a sheet containing international characters that are only supported by UTF-16, but the XML clearly labels it as `encoding="UTF-8"`. Is this encoding referring to something besides the text contained in the sheet? – user5359531 Nov 06 '16 at 00:00
  • 1
    @user5359531 "I have a sheet containing international characters that are only supported by UTF-16" - If I understand correctly, UTF-8 and UTF-16 (and UTF-32) all support all unicode characters, they just use a different encoding to do so. (UTF-8 uses 1, 2, 3, or 4 bytes, UTF-16 uses 2 or 4 bytes, and UTF-32 always uses 4 bytes). – James Mar 28 '18 at 01:10