0

I need to process a lot of .xls files which come out of this Microscopy image analysis software called Aperio (after analysis with Aperio, it allows you to export the data as "read-only" xls format. The save-as only works in Excel on a Mac, on windows machine, the save and save as buttons are greyed out since the files are protected). Unfortunately, the header of these files are not standard OLE2 format. Therefore, they cannot be picked up with Java API POI unless they are manually loaded in Microsoft Excel and save as .xls one by one.

Since there are so many of them in the directory, it would be pretty painful to do the save-as by hand. Is there a way to write a Java program to automatically save these files as standard xls files? If it is impossible for Java, what other language can handle this situation, Python?

Edit:

I loaded one of the files in hex reader and here it is: 09 04 06 00 07 00 10 00 00 00 5C 00 04 00 05 4D 44 41 80 00 08 00 00 00 00 00 00 00 00 00 92 00 19 00 06 00 00 00 00 00 F0 F0 F0 00 00 00 00 00 FF FF FF 00 00 00 00 00 FF FF FF 0C 00 02 00 01 00 0D 00 02 00 64 00 0E 00 02 00 01 00 0F 00 02 00 01 00 11 00 02 00 00 00 22 00 02 00 00 00 2A 00 02 00 00 00 2B 00 02 00 00 00 25 02 04 00 00 00 FF 00 1F 00 02 00 22 00 1E 04 0A 00 00 00 07 47 65 6E 65 72 61 6C 1E 04 04 00 00 00 01 30 1E 04 07 00 00 00 04 30 2E 30 30 1E 04 08 00 00 00 05 23 2C 23 23 30 1E 04 0B 00 00 00 08 23 2C 23 23 30 2E 30 30 1E 04 18 00 00 00 15 23 2C 23 23 30 5F F0 5F 2E 3B 5C 2D 23 2C 23 23 30 5F F0 5F 2E 1E 04 1D 00 00 00 1A 23 2C 23 23 30 5F F0 5F 2E 3B 5B 52 65 64 5D 5C 2D 23 2C 23 23 30 5F F0 5F 2E 1E 04 1E 00 00 00 1B 23 2C 23 23 30 2E 30 30 5F F0 5F 2E 3B 5C 2D 23 2C 23 23 30 2E 30 30 5F F0 5F 2E 1E 04 23 00 00 00 20 23 2C 23 23 30 2E 30 30 5F F0 5F 2E 3B 5B 52 65 64 5D 5C 2D 23 2C 23 23 30 2E 30 30 5F F0 5F 2E 1E 04 18 00 00 00 15 23 2C 23 23 30 22 F0 2E 22 3B 5C 2D 23 2C 23 23 30 22 F0 2E 22 1E 04 1D 00 00 00 1A 23 2C 23 23 30 22 F0 2E 22 3B 5B 52 65 64 5D 5C 2D 23 2C 23 23 30 22 F0 2E 22 1E 04 1E 00 00 00 1B 23 2C 23 23 30 2E 30 30 22 F0 2E 22 3B 5C 2D 23 2C 23 23 30 2E 30 30 22 F0 2E 22 1E 04 23 00 00 00 20 23 2C 23 23 30 2E 30 30 22 F0 2E 22 3B 5B 52 65 64 5D 5C 2D 23 2C 23 23 30 2E 30 30 22 F0 2E 22 1E 04 05 00 00 00 02 30 25 1E 04 08 00 00 00 05 30 2E 30 30 25 1E 04 0B 00 00 00 08 30 2E 30 30 45 2B 30 30 1E 04 0A 00 00 00 07 23 22 20 22 3F 2F 3F 1E 04 09 00 00 00 06 23 22 20 22 3F 3F 1E 04 0D 00 00 00 0A 64 64 2F 6D 6D 2F 79 79 79 79 1E 04 0C 00 00 00 09 64 64 2F 6D 6D 6D 2F 79 79 1E 04 09 00 00 00 06 64 64 2F 6D 6D 6D 1E 04 09 00 00 00 06 6D 6D 6D 2F 79 79 1E 04 0E 00 00 00 0B 68 3A 6D 6D 5C 20 41 4D 2F 50 4D 1E 04 11 00 00 00 0E 68 3A 6D 6D 3A 73 73 5C 20 41 4D 2F 50 4D 1E 04 07 00 00 00 04 68 3A 6D 6D 1E 04 0A 00 00 00 07 68 3A 6D 6D 3A 73 73 1E 04 13 00 00 00 10 64 64 2F 6D 6D 2F 79 79 79 79 5C 20 68 3A 6D 6D 1E 04 0B 00 00 00 08 23 23 30 2E 30 45 2B 30 1E 04 08 00 00 00 05 6D 6D 3A 73 73 1E 04 04 00 00 00 01 40 1E 04 36 00 00 00 33 5F 2D 2A 20 23 2C 23 23 30 22 F0 2E 22 5F 2D 3B 5C 2D 2A 20 23 2C 23 23 30 22 F0 2E 22 5F 2D 3B 5F 2D 2A 20 22 2D 22 22 F0 2E 22 5F 2D 3B 5F 2D 40 5F 2D 1E 04 36 00 00 00 33 5F 2D 2A 20 23 2C 23 23 30 5F F0 5F 2E 5F 2D 3B 5C 2D 2A 20 23 2C 23 23 30 5F F0 5F 2E 5F 2D 3B 5F 2D 2A 20 22 2D 22 5F F0 5F 2E 5F 2D 3B 5F 2D 40 5F 2D 1E 04 3E 00 00 00 3B 5F 2D 2A 20 23 2C 23 23 30 2E 30 30 22 F0 2E 22 5F 2D 3B 5C 2D 2A 20 23 2C 23 23 30 2E 30 30 22 F0 2E 22 5F 2D 3B 5F 2D 2A 20 22 2D 22 3F 3F 22 F0 2E 22 5F 2D 3B 5F 2D 40 5F 2D 1E 04 3E 00 00 00 3B 5F 2D 2A 20 23 2C 23 23 30 2E 30 30 5F F0 5F 2E 5F 2D 3B 5C 2D 2A 20 23 2C 23 23 30 2E 30 30 5F F0 5F 2E 5F 2D 3B 5F 2D 2A 20 22 2D 22 3F 3F 5F F0 5F 2E 5F 2D 3B 5F 2D 40 5F 2D 31 00 14 00 A0 00 00 00 08 00 0D 4D 53 20 53 61 6E 73 20 53 65 72 69 66 31 00 14 00 A0 00 00 00 0E 00 0D 4D 53 20 53 61 6E 73 20 53 65 72 69 66 31 00

Nasreddin
  • 1,509
  • 9
  • 31
  • 36
  • What format are they? – SLaks Feb 20 '15 at 15:53
  • they are xls, but just not the right form that can be recognized by apache-poi. I get the following error if I do not save them in Excel first `org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header signature: read 0x0010000700060409, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document` – Nasreddin Feb 20 '15 at 15:56
  • Take one file, change its extension to .zip, see if can be unzipped. Let us know. – mbmast Feb 20 '15 at 16:00
  • Also, check if it's plain text. – SLaks Feb 20 '15 at 16:01
  • They are not plain text. Less does not open them (saying they are binaries) Excel opens them without any problem. I think the header's (meta data) format is a little weird. – Nasreddin Feb 20 '15 at 16:04
  • Tried renaming a file to zip. Can't be unziped "End-of-central directory signature not found. Either this file is not a zipfile or it constitutes one disk of a multi-part archive. – Nasreddin Feb 20 '15 at 16:06
  • Hmmmm. Do you have a hex editor? Can you dump the first 1k and post it? – mbmast Feb 20 '15 at 16:10
  • Those files are Excel 4. If you upgrade to a nightly build of Apache POI, you'll get a helpful exception, and you'll be able to extract the text from them – Gagravarr Feb 21 '15 at 11:37

3 Answers3

2

Are you opposed to just doing it with an Excel macro? VBA isn't very elegant, but it is great for all things related to Excel file manipulation. This would be the code to search a directory for .xls files, create a new directory, and resave those .xls files in the new directory:

Sub Resave_Files()

    ' Directory to search for .xls files.
    file_directory = "C:\Bobby\Temp"

    ' Create a temporary directory to save files.
    temp_directory = ActiveWorkbook.Path & "\Temp"
    If Dir(temp_directory, vbDirectory) = "" Then
         MkDir temp_directory
    End If

    ' Resave each .xls file in the temporary directory.
    Dim fs, f
    Set fs = CreateObject("Scripting.FileSystemObject")
    Set f = fs.GetFolder(file_directory)
    For Each f1 In f.Files
        If (Right(f1.Name, Len(".xls")) = ".xls") Then
            wb_name = ActiveWorkbook.Name
            Set data_wb = Workbooks.Open(f1.Name)
            New_Name = temp_directory + "\" + Left(f1.Name, InStr    (f1.Name, ".xls") - 1) + "_new.xlsx"
            data_wb.SaveAs Filename:=New_Name
            data_wb.Close SaveChanges:=False
        End If
    Next

End Sub
BigBobby
  • 423
  • 1
  • 3
  • 17
  • Well, it's definitely not an OLE Compound Document. Those must begin: `Header Signature (8 bytes): Identification signature for the compound file structure, and MUST be set to the value 0xD0, 0xCF, 0x11, 0xE0, 0xA1, 0xB1, 0x1A, 0xE1.` as per Microsoft's MS-CFB Binary File Format. – mbmast Feb 20 '15 at 17:20
  • Doing a web search on 0x09 0x04 0x06 provides enough evidence to make me think this is an Excel 4.0 file. Pretty old. – mbmast Feb 20 '15 at 17:28
  • Take a look at this SO post: http://stackoverflow.com/questions/802299/reading-data-from-excel-file-prior-to-version-95 – mbmast Feb 20 '15 at 17:32
  • @BigBobby, not familiar with VBA at all. Do I have to open excel to run this code or it can be run by itself outside excel? – Nasreddin Feb 23 '15 at 15:40
  • You run it inside of Excel. To use the code I posted, you would 1) Enable the Developer Tab in Excel so that you can use macros (https://msdn.microsoft.com/en-us/library/bb608625.aspx). 2) Create a new spreadsheet and use SaveAs to make it macro enabled (it will have the .xlsm file extension instead of the .xlsx). 3) Click the Visual Basic button in the Developer tab to bring up the VBA development environment. 4) Copy/Paste the my code into the VBA editor. 5) Click the Run button. My macro will then open up all of your other .xls files and resave them using SaveAs in Excel. – BigBobby Feb 23 '15 at 16:17
1

Use JODConverter. You have an Excel 4.0 file; too old for Apache POI.

mbmast
  • 960
  • 11
  • 25
  • does JODConverter convert excel 4.0 to 97-2004 format so it'd be readable by Apache POI? – Nasreddin Feb 20 '15 at 20:21
  • It will convert to Open Office. You can also try JACOB, which allows you to control OLE apps, like Excel: http://sourceforge.net/projects/jacob-project/ – mbmast Feb 20 '15 at 21:12
  • Apache POI does support older Excel files like this (Excel 4), but only for extracting as plain text. You'll need to use at least POI 3.12 beta 1 though – Gagravarr Feb 21 '15 at 11:38
  • @mbmast, what is the best Java API for parsing ods format, reading in text by rows/cells? – Nasreddin Feb 23 '15 at 15:17
  • I have no experience working with OpenOffice. If it were me, I'd look first to Apache: http://www.openoffice.org/api/. – mbmast Feb 24 '15 at 17:59
0

Well, since you mentioned Python as a possibility, you could use xlrd to read the files (it claims support for very old Excel files). Writing them back out can be done with xlwt. For convenience, xlutils provides a copy facility to bridge xlrd and xlwt. All of these are described here. Notably, the PDF on that page is billed as a tutorial, but it also serves as the de facto manual for all three packages.

John Y
  • 14,123
  • 2
  • 48
  • 72