1

I need to read an excel(.xls) file that i'm receiving. Using the regular charsets like UTF-8, Cp1252, ISO-8859-1, UTF-16LE, none of these helped me, the characters are still malformed.

So i search ended up using juniversalchardet, it showed me that the charset was MacCyrillic, used MacCyrillic to read the file, but still the same weird outcome.

When i open the file on excel everything is fine, all the characters are fine, since its portuguese its filled whit Ç ~ and such. But opening whit notepad or trough java the file is all messed up. But if open the file on my excel and then save it again like .txt it becomes readable

My method to find the charset

    public static void lerCharset(String fileName) throws IOException {
    byte[] buf = new byte[50000000];
    FileInputStream fis = new FileInputStream(fileName);

    // (1)
    UniversalDetector detector = new UniversalDetector(null);

    // (2)
    int nread;
    while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
        detector.handleData(buf, 0, nread);
    }
    // (3)
    detector.dataEnd();

    // (4)
    String encoding = detector.getDetectedCharset();
    if (encoding != null) {
        System.out.println("Detected encoding = " + encoding);
    } else {
        System.out.println("No encoding detected.");
    }

    // (5)
    detector.reset();
    fis.close();
}

How can i discover the correct charset? Should i try a different aproach? Like making my java re-save the excel and then start reading?

Community
  • 1
  • 1
prabello
  • 556
  • 2
  • 14
  • 31

2 Answers2

3

If I'm understanding your question, you're trying to read the excel file like a text file.

The challenge is that .xls files are actually binary files containing the text, formatting, sheet information, macro information, etc...

You'd either need to save the files as .csv (Either via Excel before running your program or through your program directly), upgrade them to .xlsx (which has numerous libraries that can read the file as an XML at that point) or use a library (such as apache POI or anything similar) or even query the data out using ADO.

Good luck and I hope that's what you were implying via your question.

Community
  • 1
  • 1
John Bustos
  • 19,036
  • 17
  • 89
  • 151
  • Can i use apache POI to save this file as a .csv? This excel is not formatted, it has 2 columns and the definitions are made on the line, really confusing but that's something i will have to deal – prabello Oct 01 '14 at 19:20
  • What do you mean "definitions are made on the line"? – John Bustos Oct 01 '14 at 19:21
  • the file has column A and B, inside the column A i have the "definitions" which should be columns and on B the values. But i have like COLUM A ROW 14 -> DESCRIPTION COLUM B ROW 14 -> XXXX COLUM B ROW 15 -> CONTINUATION OF XXXX – prabello Oct 01 '14 at 19:22
  • If I were you, I'd just either use Apache to open the file and take the data out or just query the file directly (here's one link that can show you how to do that: http://www.javaprogrammingforums.com/jdbc-database-tutorials/356-how-connect-excel-spreadsheet-using-jdbc-java.html ) or look online for other libraries... It truly depends on what you're trying to accomplish... Give it a try and come back and ask questions if you get stuck.... Just do some looking around... http://stackoverflow.com/questions/2837039/how-can-i-programatically-convert-xls-and-csv-files-to-xlsx – John Bustos Oct 01 '14 at 19:27
-1

Code:

WorkbookSettings workbookSettings = new WorkbookSettings();
WorkbookSettings.setEncoding("Cp1252");
Community
  • 1
  • 1
Mr. Mascaro
  • 2,693
  • 1
  • 11
  • 18