37

I have a 150MB one-sheet excel file that takes about 7 minutes to open on a very powerful machine using the following:

# using python
import xlrd
wb = xlrd.open_workbook(file)
sh = wb.sheet_by_index(0)

Is there any way to open the excel file quicker? I'm open to even very outlandish suggestions (such as hadoop, spark, c, java, etc.). Ideally I'm looking for a way to open the file in under 30 seconds if that's not a pipe dream. Also, the above example is using python, but it doesn't have to be python.


Note: this is an Excel file from a client. It cannot be converted into any other format before we receive it. It is not our file


UPDATE: Answer with a working example of code that will open the following 200MB excel file in under 30 seconds will be rewarded with bounty: https://drive.google.com/file/d/0B_CXvCTOo7_2VW9id2VXRWZrbzQ/view?usp=sharing. This file should have string (col 1), date (col 9), and number (col 11).

David542
  • 104,438
  • 178
  • 489
  • 842
  • 4
    any reason why you're holding that much data in excel and not in a database? – twoleggedhorse Sep 22 '17 at 22:17
  • 1
    @twoleggedhorse this is provided to us by a client. We don't have control of the input(s). – David542 Sep 22 '17 at 22:18
  • Please check this answer https://stackoverflow.com/a/13177797/5790398. It might be useful in your case. I don't know similar solution in Python. – eg04lt3r Sep 22 '17 at 22:20
  • 1
    I would set up an ssis package (import export wizard will create one) in sql server and automate the import into a database. It *should* be quicker to import to db than to open in excel. At this point the world's your oyster with the data. – twoleggedhorse Sep 22 '17 at 22:21
  • Save yor Excel file as csv and a lot of options open up. Hadoop is fine, but uneccessary. Try Pandas for Python. If you only store homogeneous data types try HDF5 which uses NumPy in Python, dunno what other platforms use. Pandas can read Excel files, but I don't know how fast that method is. Last but not least you can convert csv files to sqlite databases and work with those. Edit: Does it have to be an acutal Excel file, or can you change the format? – Telefonmann Sep 22 '17 at 22:24
  • @Telefonmann it's an excel file. It's not our file and we don't have any control over it. – David542 Sep 22 '17 at 22:54
  • @David542 Then try Pandas pandas.read_excel (http://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.read_excel.html). However, you might want to try to convince your client to save the files as CSV. CSV is supported by excel, so there is no real excuse to save it as XLSX. – Telefonmann Sep 22 '17 at 23:16
  • 3
    150 Mb XL workbook is not that large these days - to open efficiently in XL first set Calculation to Manual and ForceFullCalculation to True - then if you want to use Python follow the suggestion of saving it as CSV. Whole process should be a lot faster than 7 minutes on a powerful multi-core machine. – Charles Williams Sep 23 '17 at 07:17
  • 3
    At the end of the day an Excel file is a folder containing a load of xml docs. You could extract the xml from the file and parse that instead? [link](http://professor-excel.com/xml-zip-excel-file-structure/) – Tom Sep 23 '17 at 11:06
  • 2
    How long does it take Excel to open it? Just for reference. – xiaofeng.li Sep 26 '17 at 01:10
  • @LukeLee on mac, not able to open it. On PC, maybe about a minute, but it's quite sluggish to use. – David542 Sep 26 '17 at 02:10
  • @David542 Would you allow a VBA solution? – Bluebird Sep 26 '17 at 08:16
  • 3
    I think that it is also important to know what you want to do with the data loaded. The method used to load the file could be not the best in terms of raw disk performances but better in terms on how do you try to access that data after loading it. – Steve Sep 26 '17 at 13:00
  • @Steve let's say parse the file to load certain rows into a database. – David542 Sep 26 '17 at 19:28
  • 1
    @David542: could you be more precise: for example what rows of the 200MB excel file you provide would be inserted into the database? what would be the criteria? An efficient approach would be to implement the selection as a filter written in any fast language and the resulting selection in CSV format could be loader quickly and handled by your code. – chqrlie Sep 26 '17 at 20:05
  • @chqrlie every single row would be inserted -- we would just be manipulating the data (e.g., type conversion) before inserting it into our database with the selected columns. – David542 Sep 26 '17 at 21:18
  • If you are going to insert all rows, the bottleneck is likely the database interface. The file is about 1GB of uncompressed data, how long does the insertion phase take? – chqrlie Sep 26 '17 at 21:24
  • @chqrlie that's fine -- the main part here is optimizing the opening of the file, etc. It's possible we'll save it to a csv or something else after doing what we need with loading it and then doing some processing on the data. – David542 Sep 26 '17 at 21:30
  • xls files are binary, xlsx files are just a zip of xml files, as Tom mentioned. extract then parse xml might be fasted way. – athos Sep 27 '17 at 04:28
  • @athos want to test that in an answer? – David542 Sep 27 '17 at 04:57
  • would like to do that after some solid result. need download the sample file after i walk around the firewall. – athos Sep 27 '17 at 06:22
  • Just curious, how long will it take to load text/csv file of the same size? – ASpirin Sep 27 '17 at 08:31
  • @ASpirin what do you mean load? I would just iterate it line-by-line. – David542 Sep 28 '17 at 02:00
  • @David542, On my machine your python program runs in 3 minutes, and it's not a strong laptop... I'm wondering - How long does it take you to convert the file to csv using either of these methods? xmodulo.com/how-to-convert-xlsx-files-to-xls-or-csv.html – user1708860 – user1708860 Sep 29 '17 at 22:48
  • For this problem, I prefer to index it explicitly. Due to the performance requirements you can make nested tree and access the metadata instead of real data. Take a look at Google's bigtable concept (the theory not the implementation). I hope it helps. ( I can Imagine it might be done in a second) – Amiri Sep 30 '17 at 08:18
  • @Martin cool -- any interest in posting an answer showing how it would be done? – David542 Sep 30 '17 at 09:10
  • @David542, unfortunately I don't have a real experience but theoretically it might works. I hope some experts read my comment and provide a solution or correct it. – Amiri Sep 30 '17 at 12:05
  • @Martin if someone could open the file in 10s or under I would jump for joy -- and of course, select that as the answer. Want to look at your idea and see if it's possible for you to implement in an answer? – David542 Sep 30 '17 at 18:38
  • Dear @David542, I only suggested a possible theoretical solution. My programming knowledge is not sufficient to implement it – Amiri Sep 30 '17 at 21:00
  • @David542, what do you after opening the file? Because loading a file in less than 30 second, but then what needs to be done is important? – Tarun Lalwani Oct 02 '17 at 18:09

11 Answers11

14

Most programming languages that work with Office products have some middle layer and this is usually where the bottleneck is, a good example is using PIA's/Interop or Open XML SDK.

One way to get the data at a lower level (bypassing the middle layer) is using a Driver.

150MB one-sheet excel file that takes about 7 minutes.

The best I could do is a 130MB file in 135 seconds, roughly 3 times faster:

Stopwatch sw = new Stopwatch();
sw.Start();

DataSet excelDataSet = new DataSet();

string filePath = @"c:\temp\BigBook.xlsx";

// For .XLSXs we use =Microsoft.ACE.OLEDB.12.0;, for .XLS we'd use Microsoft.Jet.OLEDB.4.0; with  "';Extended Properties=\"Excel 8.0;HDR=YES;\"";
string connectionString = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source='" + filePath + "';Extended Properties=\"Excel 12.0;HDR=YES;\"";

using (OleDbConnection conn = new OleDbConnection(connectionString))
{
    conn.Open();
    OleDbDataAdapter objDA = new System.Data.OleDb.OleDbDataAdapter
    ("select * from [Sheet1$]", conn);
    objDA.Fill(excelDataSet);
    //dataGridView1.DataSource = excelDataSet.Tables[0];
}
sw.Stop();
Debug.Print("Load XLSX tool: " + sw.ElapsedMilliseconds + " millisecs. Records = "  + excelDataSet.Tables[0].Rows.Count);
        

enter image description here

Win 7x64, Intel i5, 2.3ghz, 8GB ram, SSD250GB.

If I could recommend a hardware solution as well, try to resolve it with an SSD if you're using standard HDDs.

Note: I can't download your Excel spreadsheet example as I'm behind a corporate firewall.

PS. See MSDN - Fastest Way to import xlsx files with 200 MB of Data, the consensus being OleDB is the fastest.

PS 2. Here's how you can do it with Python: http://code.activestate.com/recipes/440661-read-tabular-data-from-excel-spreadsheets-the-fast/

halfer
  • 19,824
  • 17
  • 99
  • 186
Jeremy Thompson
  • 61,933
  • 36
  • 195
  • 321
12

I managed to read the file in about 30 seconds using .NET core and the Open XML SDK.

The following example returns a list of objects containing all rows and cells with the matching types, it supports date, numeric and text cells. The project is available here: https://github.com/xferaa/BigSpreadSheetExample/ (Should work on Windows, Linux and Mac OS and does not require Excel or any Excel component to be installed).

public List<List<object>> ParseSpreadSheet()
{
    List<List<object>> rows = new List<List<object>>();

    using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(filePath, false))
    {
        WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
        WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();

        OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);

        Dictionary<int, string> sharedStringCache = new Dictionary<int, string>();

        int i = 0;
        foreach (var el in workbookPart.SharedStringTablePart.SharedStringTable.ChildElements)
        {
            sharedStringCache.Add(i++, el.InnerText);
        }

        while (reader.Read())
        {
            if(reader.ElementType == typeof(Row))
            {
                reader.ReadFirstChild();

                List<object> cells = new List<object>();

                do
                {
                    if (reader.ElementType == typeof(Cell))
                    {
                        Cell c = (Cell)reader.LoadCurrentElement();

                        if (c == null || c.DataType == null || !c.DataType.HasValue)
                            continue;

                        object value;

                        switch(c.DataType.Value)
                        {
                            case CellValues.Boolean:
                                value = bool.Parse(c.CellValue.InnerText);
                                break;
                            case CellValues.Date:
                                value = DateTime.Parse(c.CellValue.InnerText);
                                break;
                            case CellValues.Number:
                                value = double.Parse(c.CellValue.InnerText);
                                break;
                            case CellValues.InlineString:
                            case CellValues.String:
                                value = c.CellValue.InnerText;
                                break;
                            case CellValues.SharedString:
                                value = sharedStringCache[int.Parse(c.CellValue.InnerText)];
                                break;
                            default:
                                continue;
                        }

                        if (value != null)
                            cells.Add(value);
                    }

                } while (reader.ReadNextSibling());

                if (cells.Any())
                    rows.Add(cells);
            }
        }
    }

    return rows;
}

I ran the program in a three year old Laptop with a SSD drive, 8GB of RAM and an Intel Core i7-4710 CPU @ 2.50GHz (two cores) on Windows 10 64 bits.

Note that although opening and parsing the whole file as strings takes a bit less than 30 seconds, when using objects as in the example of my last edit, the time goes up to almost 50 seconds with my crappy laptop. You will probably get closer to 30 seconds in your server with Linux.

The trick was to use the SAX approach as explained here:

https://msdn.microsoft.com/en-us/library/office/gg575571.aspx

Isma
  • 14,604
  • 5
  • 37
  • 51
  • very neat approach, thank you. What do you mean by using `objects` ? Is this a non-string, or what? – David542 Sep 28 '17 at 17:55
  • No problem , this is fun stuff ;) The value from each cell of the Excel file (which is stored as XML text) is converted to an object according the data type of the cell so you end up having a list of a list of objects. The first list contains each row and each row contains a second list with each cell. This is a bit slower but it will pay off when you want to do something with the data and save it to the database. For example, you could use entity framework bulk import to import all objects at once. – Isma Sep 28 '17 at 18:33
  • 1
    This is likely the fastest possible, without unzipping and parsing the XML manually, which will be a lot of code, and difficult to maintain. – MineR Sep 30 '17 at 08:31
  • I think so too, yet people seem to be leaning towards the OleDb solution which is at least 50% slower and IMHO less clean. – Isma Sep 30 '17 at 09:25
  • @Isma `case CellValues.Date:` is Office 2010 only data type. Casual dates stored as shared string with a correspoing double value to convert an OADate with 1900 or 1904 epoch, no data type present for cells. Phew. You'll need to check StyleIndex of the cells to format date or number (or percent etc..) So there's a lot work in fact. – Kul-Tigin Sep 30 '17 at 10:58
  • @Isma one last thing. Empty cells are not present in the xml document, you also need to handle this. Looking cell references in the loop would help to detect missing ones (jumping cell A1 to D1 gives a clue), so you can add empty items to your `cells` list. Eventually this is not a format easy to handle. There are too many tricky parts like these. – Kul-Tigin Sep 30 '17 at 12:11
10

Well, if your excel is going to be as simple as a CSV file like your example (https://drive.google.com/file/d/0B_CXvCTOo7_2UVZxbnpRaEVnaFk/view?usp=sharing), you can try to open the file as a zip file and read directly every xml:

Intel i5 4460, 12 GB RAM, SSD Samsung EVO PRO.

If you have a lot of memory ram: This code needs a lot of ram, but it takes 20~25 seconds. (You need the parameter -Xmx7g)

package com.devsaki.opensimpleexcel;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.nio.charset.Charset;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.zip.ZipFile;

public class Multithread {

    public static final char CHAR_END = (char) -1;

    public static void main(String[] args) throws IOException, ExecutionException, InterruptedException {
        String excelFile = "C:/Downloads/BigSpreadsheetAllTypes.xlsx";
        ZipFile zipFile = new ZipFile(excelFile);
        long init = System.currentTimeMillis();
        ExecutorService executor = Executors.newFixedThreadPool(4);
        char[] sheet1 = readEntry(zipFile, "xl/worksheets/sheet1.xml").toCharArray();
        Future<Object[][]> futureSheet1 = executor.submit(() -> processSheet1(new CharReader(sheet1), executor));
        char[] sharedString = readEntry(zipFile, "xl/sharedStrings.xml").toCharArray();
        Future<String[]> futureWords = executor.submit(() -> processSharedStrings(new CharReader(sharedString)));

        Object[][] sheet = futureSheet1.get();
        String[] words = futureWords.get();

        executor.shutdown();

        long end = System.currentTimeMillis();
        System.out.println("only read: " + (end - init) / 1000);

        ///Doing somethin with the file::Saving as csv
        init = System.currentTimeMillis();
        try (PrintWriter writer = new PrintWriter(excelFile + ".csv", "UTF-8");) {
            for (Object[] rows : sheet) {
                for (Object cell : rows) {
                    if (cell != null) {
                        if (cell instanceof Integer) {
                            writer.append(words[(Integer) cell]);
                        } else if (cell instanceof String) {
                            writer.append(toDate(Double.parseDouble(cell.toString())));
                        } else {
                            writer.append(cell.toString()); //Probably a number
                        }
                    }
                    writer.append(";");
                }
                writer.append("\n");
            }
        }
        end = System.currentTimeMillis();
        System.out.println("Main saving to csv: " + (end - init) / 1000);
    }

    private static final DateTimeFormatter formatter = DateTimeFormatter.ISO_DATE_TIME;
    private static final LocalDateTime INIT_DATE = LocalDateTime.parse("1900-01-01T00:00:00+00:00", formatter).plusDays(-2);

    //The number in excel is from 1900-jan-1, so every number time that you get, you have to sum to that date
    public static String toDate(double s) {
        return formatter.format(INIT_DATE.plusSeconds((long) ((s*24*3600))));
    }

    public static String readEntry(ZipFile zipFile, String entry) throws IOException {
        System.out.println("Initialing readEntry " + entry);
        long init = System.currentTimeMillis();
        String result = null;

        try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) {
            br.readLine();
            result = br.readLine();
        }

        long end = System.currentTimeMillis();
        System.out.println("readEntry '" + entry + "': " + (end - init) / 1000);
        return result;
    }


    public static String[] processSharedStrings(CharReader br) throws IOException {
        System.out.println("Initialing processSharedStrings");
        long init = System.currentTimeMillis();
        String[] words = null;
        char[] wordCount = "Count=\"".toCharArray();
        char[] token = "<t>".toCharArray();
        String uniqueCount = extractNextValue(br, wordCount, '"');
        words = new String[Integer.parseInt(uniqueCount)];
        String nextWord;
        int currentIndex = 0;
        while ((nextWord = extractNextValue(br, token, '<')) != null) {
            words[currentIndex++] = nextWord;
            br.skip(11); //you can skip at least 11 chars "/t></si><si>"
        }
        long end = System.currentTimeMillis();
        System.out.println("SharedStrings: " + (end - init) / 1000);
        return words;
    }


    public static Object[][] processSheet1(CharReader br, ExecutorService executorService) throws IOException, ExecutionException, InterruptedException {
        System.out.println("Initialing processSheet1");
        long init = System.currentTimeMillis();
        char[] dimensionToken = "dimension ref=\"".toCharArray();
        String dimension = extractNextValue(br, dimensionToken, '"');
        int[] sizes = extractSizeFromDimention(dimension.split(":")[1]);
        br.skip(30); //Between dimension and next tag c exists more or less 30 chars
        Object[][] result = new Object[sizes[0]][sizes[1]];

        int parallelProcess = 8;
        int currentIndex = br.currentIndex;
        CharReader[] charReaders = new CharReader[parallelProcess];
        int totalChars = Math.round(br.chars.length / parallelProcess);
        for (int i = 0; i < parallelProcess; i++) {
            int endIndex = currentIndex + totalChars;
            charReaders[i] = new CharReader(br.chars, currentIndex, endIndex, i);
            currentIndex = endIndex;
        }
        Future[] futures = new Future[parallelProcess];
        for (int i = charReaders.length - 1; i >= 0; i--) {
            final int j = i;
            futures[i] = executorService.submit(() -> inParallelProcess(charReaders[j], j == 0 ? null : charReaders[j - 1], result));
        }
        for (Future future : futures) {
            future.get();
        }

        long end = System.currentTimeMillis();
        System.out.println("Sheet1: " + (end - init) / 1000);
        return result;
    }

    public static void inParallelProcess(CharReader br, CharReader back, Object[][] result) {
        System.out.println("Initialing inParallelProcess : " + br.identifier);

        char[] tokenOpenC = "<c r=\"".toCharArray();
        char[] tokenOpenV = "<v>".toCharArray();

        char[] tokenAttributS = " s=\"".toCharArray();
        char[] tokenAttributT = " t=\"".toCharArray();

        String v;
        int firstCurrentIndex = br.currentIndex;
        boolean first = true;

        while ((v = extractNextValue(br, tokenOpenC, '"')) != null) {
            if (first && back != null) {
                int sum = br.currentIndex - firstCurrentIndex - tokenOpenC.length - v.length() - 1;
                first = false;
                System.out.println("Adding to : " + back.identifier + " From : " + br.identifier);
                back.plusLength(sum);
            }
            int[] indexes = extractSizeFromDimention(v);

            int s = foundNextTokens(br, '>', tokenAttributS, tokenAttributT);
            char type = 's'; //3 types: number (n), string (s) and date (d)
            if (s == 0) { // Token S = number or date
                char read = br.read();
                if (read == '1') {
                    type = 'n';
                } else {
                    type = 'd';
                }
            } else if (s == -1) {
                type = 'n';
            }
            String c = extractNextValue(br, tokenOpenV, '<');
            Object value = null;
            switch (type) {
                case 'n':
                    value = Double.parseDouble(c);
                    break;
                case 's':
                    try {
                        value = Integer.parseInt(c);
                    } catch (Exception ex) {
                        System.out.println("Identifier Error : " + br.identifier);
                    }
                    break;
                case 'd':
                    value = c.toString();
                    break;
            }
            result[indexes[0] - 1][indexes[1] - 1] = value;
            br.skip(7); ///v></c>
        }
    }

    static class CharReader {
        char[] chars;
        int currentIndex;
        int length;

        int identifier;

        public CharReader(char[] chars) {
            this.chars = chars;
            this.length = chars.length;
        }

        public CharReader(char[] chars, int currentIndex, int length, int identifier) {
            this.chars = chars;
            this.currentIndex = currentIndex;
            if (length > chars.length) {
                this.length = chars.length;
            } else {
                this.length = length;
            }
            this.identifier = identifier;
        }

        public void plusLength(int n) {
            if (this.length + n <= chars.length) {
                this.length += n;
            }
        }

        public char read() {
            if (currentIndex >= length) {
                return CHAR_END;
            }
            return chars[currentIndex++];
        }

        public void skip(int n) {
            currentIndex += n;
        }
    }


    public static int[] extractSizeFromDimention(String dimention) {
        StringBuilder sb = new StringBuilder();
        int columns = 0;
        int rows = 0;
        for (char c : dimention.toCharArray()) {
            if (columns == 0) {
                if (Character.isDigit(c)) {
                    columns = convertExcelIndex(sb.toString());
                    sb = new StringBuilder();
                }
            }
            sb.append(c);
        }
        rows = Integer.parseInt(sb.toString());
        return new int[]{rows, columns};
    }

    public static int foundNextTokens(CharReader br, char until, char[]... tokens) {
        char character;
        int[] indexes = new int[tokens.length];
        while ((character = br.read()) != CHAR_END) {
            if (character == until) {
                break;
            }
            for (int i = 0; i < indexes.length; i++) {
                if (tokens[i][indexes[i]] == character) {
                    indexes[i]++;
                    if (indexes[i] == tokens[i].length) {
                        return i;
                    }
                } else {
                    indexes[i] = 0;
                }
            }
        }

        return -1;
    }

    public static String extractNextValue(CharReader br, char[] token, char until) {
        char character;
        StringBuilder sb = new StringBuilder();
        int index = 0;

        while ((character = br.read()) != CHAR_END) {
            if (index == token.length) {
                if (character == until) {
                    return sb.toString();
                } else {
                    sb.append(character);
                }
            } else {
                if (token[index] == character) {
                    index++;
                } else {
                    index = 0;
                }
            }
        }
        return null;
    }

    public static int convertExcelIndex(String index) {
        int result = 0;
        for (char c : index.toCharArray()) {
            result = result * 26 + ((int) c - (int) 'A' + 1);
        }
        return result;
    }
}

Old answer (Not need the parameter Xms7g, so take less memory): It takes to open and read the example file about 35 seconds (200MB) with an HDD, with SDD takes a little less (30 seconds).

Here the code: https://github.com/csaki/OpenSimpleExcelFast.git

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.nio.charset.Charset;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.zip.ZipFile;

public class Launcher {

    public static final char CHAR_END = (char) -1;

    public static void main(String[] args) throws IOException, ExecutionException, InterruptedException {
        long init = System.currentTimeMillis();
        String excelFile = "D:/Downloads/BigSpreadsheet.xlsx";
        ZipFile zipFile = new ZipFile(excelFile);

        ExecutorService executor = Executors.newFixedThreadPool(4);
        Future<String[]> futureWords = executor.submit(() -> processSharedStrings(zipFile));
        Future<Object[][]> futureSheet1 = executor.submit(() -> processSheet1(zipFile));
        String[] words = futureWords.get();
        Object[][] sheet1 = futureSheet1.get();
        executor.shutdown();

        long end = System.currentTimeMillis();
        System.out.println("Main only open and read: " + (end - init) / 1000);


        ///Doing somethin with the file::Saving as csv
        init = System.currentTimeMillis();
        try (PrintWriter writer = new PrintWriter(excelFile + ".csv", "UTF-8");) {
            for (Object[] rows : sheet1) {
                for (Object cell : rows) {
                    if (cell != null) {
                        if (cell instanceof Integer) {
                            writer.append(words[(Integer) cell]);
                        } else if (cell instanceof String) {
                            writer.append(toDate(Double.parseDouble(cell.toString())));
                        } else {
                            writer.append(cell.toString()); //Probably a number
                        }
                    }
                    writer.append(";");
                }
                writer.append("\n");
            }
        }
        end = System.currentTimeMillis();
        System.out.println("Main saving to csv: " + (end - init) / 1000);
    }

    private static final DateTimeFormatter formatter = DateTimeFormatter.ISO_DATE_TIME;
    private static final LocalDateTime INIT_DATE = LocalDateTime.parse("1900-01-01T00:00:00+00:00", formatter).plusDays(-2);

    //The number in excel is from 1900-jan-1, so every number time that you get, you have to sum to that date
    public static String toDate(double s) {
        return formatter.format(INIT_DATE.plusSeconds((long) ((s*24*3600))));
    }

    public static Object[][] processSheet1(ZipFile zipFile) throws IOException {
        String entry = "xl/worksheets/sheet1.xml";
        Object[][] result = null;
        char[] dimensionToken = "dimension ref=\"".toCharArray();
        char[] tokenOpenC = "<c r=\"".toCharArray();
        char[] tokenOpenV = "<v>".toCharArray();

        char[] tokenAttributS = " s=\"".toCharArray();
        char[] tokenAttributT = " t=\"".toCharArray();
        try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) {
            String dimension = extractNextValue(br, dimensionToken, '"');
            int[] sizes = extractSizeFromDimention(dimension.split(":")[1]);
            br.skip(30); //Between dimension and next tag c exists more or less 30 chars
            result = new Object[sizes[0]][sizes[1]];
            String v;
            while ((v = extractNextValue(br, tokenOpenC, '"')) != null) {
                int[] indexes = extractSizeFromDimention(v);

                int s = foundNextTokens(br, '>', tokenAttributS, tokenAttributT);
                char type = 's'; //3 types: number (n), string (s) and date (d)
                if (s == 0) { // Token S = number or date
                    char read = (char) br.read();
                    if (read == '1') {
                        type = 'n';
                    } else {
                        type = 'd';
                    }
                } else if (s == -1) {
                    type = 'n';
                }
                String c = extractNextValue(br, tokenOpenV, '<');
                Object value = null;
                switch (type) {
                    case 'n':
                        value = Double.parseDouble(c);
                        break;
                    case 's':
                        value = Integer.parseInt(c);
                        break;
                    case 'd':
                        value = c.toString();
                        break;
                }
                result[indexes[0] - 1][indexes[1] - 1] = value;
                br.skip(7); ///v></c>
            }
        }
        return result;
    }

    public static int[] extractSizeFromDimention(String dimention) {
        StringBuilder sb = new StringBuilder();
        int columns = 0;
        int rows = 0;
        for (char c : dimention.toCharArray()) {
            if (columns == 0) {
                if (Character.isDigit(c)) {
                    columns = convertExcelIndex(sb.toString());
                    sb = new StringBuilder();
                }
            }
            sb.append(c);
        }
        rows = Integer.parseInt(sb.toString());
        return new int[]{rows, columns};
    }

    public static String[] processSharedStrings(ZipFile zipFile) throws IOException {
        String entry = "xl/sharedStrings.xml";
        String[] words = null;
        char[] wordCount = "Count=\"".toCharArray();
        char[] token = "<t>".toCharArray();
        try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) {
            String uniqueCount = extractNextValue(br, wordCount, '"');
            words = new String[Integer.parseInt(uniqueCount)];
            String nextWord;
            int currentIndex = 0;
            while ((nextWord = extractNextValue(br, token, '<')) != null) {
                words[currentIndex++] = nextWord;
                br.skip(11); //you can skip at least 11 chars "/t></si><si>"
            }
        }
        return words;
    }

    public static int foundNextTokens(BufferedReader br, char until, char[]... tokens) throws IOException {
        char character;
        int[] indexes = new int[tokens.length];
        while ((character = (char) br.read()) != CHAR_END) {
            if (character == until) {
                break;
            }
            for (int i = 0; i < indexes.length; i++) {
                if (tokens[i][indexes[i]] == character) {
                    indexes[i]++;
                    if (indexes[i] == tokens[i].length) {
                        return i;
                    }
                } else {
                    indexes[i] = 0;
                }
            }
        }

        return -1;
    }

    public static String extractNextValue(BufferedReader br, char[] token, char until) throws IOException {
        char character;
        StringBuilder sb = new StringBuilder();
        int index = 0;

        while ((character = (char) br.read()) != CHAR_END) {
            if (index == token.length) {
                if (character == until) {
                    return sb.toString();
                } else {
                    sb.append(character);
                }
            } else {
                if (token[index] == character) {
                    index++;
                } else {
                    index = 0;
                }
            }
        }
        return null;
    }

    public static int convertExcelIndex(String index) {
        int result = 0;
        for (char c : index.toCharArray()) {
            result = result * 26 + ((int) c - (int) 'A' + 1);
        }
        return result;
    }

}
devsaki
  • 751
  • 3
  • 10
  • Thanks, I'll test this out -- want to put the GitHub code for the main file, Launcher.java, in the answer itself? I think that might be more helpful for future people that may want to view it. – David542 Sep 26 '17 at 19:24
  • Well, I check what happens when the cell is not a string and the code is not going to work, but implementing that features is not hard (only check what the attribute t is in c tag. "s" => string). The cell styles shouldn't break the code. – devsaki Sep 26 '17 at 21:21
  • got it -- I would think the three types we'll need are (1) string; (2) number; and (3) date. I know Excel stores a date as a number but we'd need to be able to know that a cell is a date somehow. What does the timing take allowing (2) and (3)? – David542 Sep 26 '17 at 21:28
  • Ok, I added suppport to date and numbers – devsaki Sep 26 '17 at 23:21
  • great, I'll also update the question/file so that it includes numeric and date fields so that it can be tested easily. – David542 Sep 26 '17 at 23:56
  • The updated file doesn't have dates. But I already updated my answer supporting numbers and date types, didn't work? I tested with your new file and it's working here. – devsaki Sep 27 '17 at 03:28
  • I added another solution, it takes less time, but it needs a lot of heap memory. – devsaki Oct 02 '17 at 23:59
6

Python's Pandas library could be used to hold and process your data, but using it to directly load the .xlsx file will be quite slow, e.g. using read_excel().

One approach would be to use Python to automate the conversion of your file into CSV using Excel itself and to then use Pandas to load the resulting CSV file using read_csv(). This will give you a good speed up, but not under 30 seconds:

import win32com.client as win32        
import pandas as pd    
from datetime import datetime    

print ("Starting")
start = datetime.now()

# Use Excel to load the xlsx file and save it in csv format
excel = win32.gencache.EnsureDispatch('Excel.Application')
wb = excel.Workbooks.Open(r'c:\full path\BigSpreadsheet.xlsx')
excel.DisplayAlerts = False
wb.DoNotPromptForConvert = True
wb.CheckCompatibility = False

print('Saving')
wb.SaveAs(r'c:\full path\temp.csv', FileFormat=6, ConflictResolution=2) 
excel.Application.Quit()

# Use Pandas to load the resulting CSV file
print('Loading CSV')
df = pd.read_csv(r'c:\full path\temp.csv', dtype=str)

print(df.shape)
print("Done", datetime.now() - start)

Column types
The types for your columns can be specified by passing dtype and converters and parse_dates:

df = pd.read_csv(r'c:\full path\temp.csv', dtype=str, converters={10:int}, parse_dates=[8], infer_datetime_format=True)

You should also specify infer_datetime_format=True, as this will greatly speed up the date conversion.

nfer_datetime_format : boolean, default False

If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x.

Also add dayfirst=True if dates are in the form DD/MM/YYYY.

Selective columns
If you actually only need to work on columns 1 9 11, then you could further reduce resources by specifying usecols=[0, 8, 10] as follows:

df = pd.read_csv(r'c:\full path\temp.csv', dtype=str, converters={10:int}, parse_dates=[1], dayfirst=True, infer_datetime_format=True, usecols=[0, 8, 10])

The resulting dataframe would then only contain those 3 columns of data.

RAM drive
Using a RAM drive to store the temporary CSV file to would further speed up the load time.

Note: This does assume you are using a Windows PC with Excel available.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • Thanks for this answer -- could you time it with the sample excel file and see how it performs? – David542 Sep 26 '17 at 19:26
  • 1
    Running it on my machine gives ~60s. In comparison, I get 7m30s using just `xlrd`. – Martin Evans Sep 26 '17 at 20:54
  • 1
    Use Python Pandas is a superb way to convert Excel to CSV. However, I was able to save more than 30 extra seconds with my slow computer (148s down to 113s, https://i.stack.imgur.com/EzbvR.jpg) with running VBScript in Python. But OP may not like it so just leave this in the comment. – ian0411 Sep 28 '17 at 17:31
5

I have created an sample Java program which is able to load the file in ~40 seconds my laptop ( Intel i7 4 core, 16 GB RAM).

https://github.com/skadyan/largefile

This program uses the Apache POI library to load the .xlsx file using the XSSF SAX API.

The callback interface com.stackoverlfow.largefile.RecordHandler implementation can be used to process the data loaded from the excel. This interface define only one method which take three arguments

  • sheetname : String, excel sheet name
  • row number: int, row number of data
  • and data map: Map: excel cell reference and excel formatted cell value

The class com.stackoverlfow.largefile.Main demonstrate one basic implementation of this interface which just print the row number on console.

Update

woodstox parser seems have better performance than standard SAXReader. (code updated in repo).

Also in order to meet the desired performance requirement, you may consider to re-implement the org.apache.poi...XSSFSheetXMLHandler. In the implementation, more optimized string/text value handling can be implemented and unnecessary text formatting operation may be skipped.

skadya
  • 4,330
  • 19
  • 27
  • actually this is the exact implementation we are currently using to parse the file. So looking for a solution that's faster than this. – David542 Oct 01 '17 at 19:19
  • To improve the performance what I can suggest you to try to use different Xml parser (e.g. https://github.com/FasterXML/woodstox). I observe slightly better results with this parser. see updated answer. – skadya Oct 02 '17 at 20:50
4

I'm using a Dell Precision T1700 workstation and using c# I was able to open the file and read it's contents in about 24 seconds just using standard code to open a workbook using interop services. Using references to the Microsoft Excel 15.0 Object Library here is my code.

My using statements:

using System.Runtime.InteropServices;
using Excel = Microsoft.Office.Interop.Excel;

Code to open and read workbook:

public partial class MainWindow : Window {
    public MainWindow() {
        InitializeComponent();

        Excel.Application xlApp;
        Excel.Workbook wb;
        Excel.Worksheet ws;

        xlApp = new Excel.Application();
        xlApp.Visible = false;
        xlApp.ScreenUpdating = false;

        wb = xlApp.Workbooks.Open(@"Desired Path of workbook\Copy of BigSpreadsheet.xlsx");

        ws = wb.Sheets["Sheet1"];

        //string rng = ws.get_Range("A1").Value;
        MessageBox.Show(ws.get_Range("A1").Value);

        Marshal.FinalReleaseComObject(ws);

        wb.Close();
        Marshal.FinalReleaseComObject(wb);

        xlApp.Quit();
        Marshal.FinalReleaseComObject(xlApp);

        GC.Collect();
        GC.WaitForPendingFinalizers();
    }
}
R. Roe
  • 609
  • 7
  • 18
3

Looks like it is hardly achievable in Python at all. If we unpack a sheet data file then it would take all required 30 seconds just to pass it through the C-based iterative SAX parser (using lxml, a very fast wrapper over libxml2):

from __future__ import print_function

from lxml import etree
import time


start_ts = time.time()

for data in etree.iterparse(open('xl/worksheets/sheet1.xml'), events=('start',), 
                            collect_ids=False, resolve_entities=False,
                            huge_tree=True):
    pass

print(time.time() - start_ts)

The sample output: 27.2134890556

By the way, the Excel itself needs about 40 seconds to load the workbook.

void
  • 2,759
  • 12
  • 28
3

The c# and ole solution still have some bottleneck.So i test it by c++ and ado.

_bstr_t connStr(makeConnStr(excelFile, header).c_str());

TESTHR(pRec.CreateInstance(__uuidof(Recordset)));       
TESTHR(pRec->Open(sqlSelectSheet(connStr, sheetIndex).c_str(), connStr, adOpenStatic, adLockOptimistic, adCmdText));

while(!pRec->adoEOF)
{
    for(long i = 0; i < pRec->Fields->GetCount(); ++i)
    {   
        _variant_t v = pRec->Fields->GetItem(i)->Value;
        if(v.vt == VT_R8)
            num[i] = v.dblVal;
        if(v.vt == VT_BSTR)
            str[i] = v.bstrVal;          
        ++cellCount;
    }                                    
    pRec->MoveNext();
}

In i5-4460 and HDD machine,i find 500 thousands of cell in xls will take 1.5s.But same data in xlsx will take 2.829s.so it's possible for manipulating your data under 30s.

If you really need under 30s,use RAM Drive to reduce file IO.It will significantly improve your process. I cannot download your data to test it,so please tell me the result.

obgnaw
  • 3,007
  • 11
  • 25
  • thanks for this. So what would be the whole code to 'read' the xls(x) file and get it in memory from the above xls file? – David542 Sep 30 '17 at 04:54
  • @David542 [here is the entire project](https://www.codeproject.com/Articles/126280/How-to-Read-and-Write-Excel-Files-in-C-via-ADO) – obgnaw Sep 30 '17 at 05:04
  • This is still using ADO internally so the results won't be much faster than the ADO.NET version which by the way is getting most of the votes but it is not the fastest / cleanest solution. – Isma Sep 30 '17 at 09:02
  • @Isma There are two bottleneck, middle layer and file IO.The quickest version need assumption.and i think OP finally will need RAM Drive to reduce time of file IO. – obgnaw Sep 30 '17 at 09:35
1

Another way that should improve largely the load/operation time is a RAMDrive

create a RAMDrive with enough space for your file and a 10%..20% extra space...
copy the file for the RAMDrive...
Load the file from there... depending on your drive and filesystem the speed improvement should be huge...

My favourite is IMDisk toolkit
(https://sourceforge.net/projects/imdisk-toolkit/) here you have a powerfull command line to script everything...

I also recommend SoftPerfect ramdisk
(http://www.majorgeeks.com/files/details/softperfect_ram_disk.html)

but that also depends of your OS...

ZEE
  • 2,931
  • 5
  • 35
  • 47
0

I would like to have more info about the system where you are opening the file... anyway:

look in your system for a Windows update called
"Office File Validation Add-In for Office ..."

if you have it... uninstall it...
the file should load much more quickly
specially if is loaded froma share

ZEE
  • 2,931
  • 5
  • 35
  • 47
-1

Have you tried loading the worksheet on demand, which available since version 0.7.1 of xlrd?

To do this you need to pass on_demand=True to open_workbook().

xlrd.open_workbook(filename=None, logfile=<_io.TextIOWrapper name='' mode='w' encoding='UTF-8'>, verbosity=0, use_mmap=1, file_contents=None, encoding_override=None, formatting_info=False, on_demand=False, ragged_rows=False)


Other potential python solutions I found for reading an xlsx file:

  • Read the raw xml in 'xl/sharedStrings.xml' and 'xl/worksheets/sheet1.xml'
  • Try the openpyxl library's Read Only mode which claims too be optimized in memory usage for large files.

    from openpyxl import load_workbook wb = load_workbook(filename='large_file.xlsx', read_only=True) ws = wb['big_data']
    
    for row in ws.rows:
        for cell in row:
            print(cell.value)
    
  • If you are running on Windows you could use PyWin32 and 'Excel.Application'

    import time
    import win32com.client as win32
    def excel():
       xl = win32.gencache.EnsureDispatch('Excel.Application')
       ss = xl.Workbooks.Add()
    ...
    
mikek3332002
  • 3,546
  • 4
  • 37
  • 47
  • 1
    This seems to only load the necessary sheets, but in the above case there is only one sheet to load, so the `on_demand` saves no time (I tested it). – David542 Sep 26 '17 at 02:54