5

I'm working on a project where the data set is stored across thousands of Excel workbooks (currently ~14000) in a folder on a Sharepoint server, each of which gets regularly modified to reflect changes to the data subset in that workbook. I know it's not a good way to store and update this data, but that's the situation as is.

I have to perform various queries that involve extracting the contents of one or more fields from all workbooks in the folder. I've been doing this by iterating through the set of workbooks as follows:

Function getData() As Workbook

Dim resultBk As Workbook
Dim fldr As Folder
Dim fso As New FileSystemObject
Dim fileObj As File
Dim filePath As String
Dim queryBk As Workbook

'create a workbook for storing the query results
Set resultBk = Workbooks.Add(resultBkTemplatePath)

'get the folder with all the workbooks to be queried
Set fldr = fso.GetFolder(sharepointFolderPath)

For Each fileObj In fldr.Files
    'try opening each of the workbooks
    Set queryBk = Workbooks.Open(fileObj.Path, ReadOnly:=True)

    'get data from queryBk and add it to resultBk, or add a row with an error message if queryBk failed to open
    addBkDataToResults resultBk, queryBk

    queryBk.Close False
    Set queryBk = Nothing
Next

getData = resultBk

End Function

This process of opening and closing so many workbooks across a Sharepoint connection is very slow; it generally takes 12-14 hours to run a complete query. Is there a faster way to read data out of an Excel book that doesn't need to open/close it? Or is there a way that I can handle the open/close process to make this faster?

Community
  • 1
  • 1
sigil
  • 9,370
  • 40
  • 119
  • 199
  • 2
    Haha. Hah. Repeat after me: "SharePoint is *not* a database". On a more serious note, this can likely be done faster using an approach that does not rely on COM+/OLE, such as a 3rd party library that can read said files independently from Office. –  Feb 12 '13 at 20:29
  • @pst: and for sure that's all sigil's fault, right? not very constructive today.. – Peter Albert Feb 12 '13 at 20:30
  • @PeterAlbert I never said it was. And I offered a path to follow. –  Feb 12 '13 at 20:31
  • 2
    See http://stackoverflow.com/questions/15828/reading-excel-files-from-c-sharp - it comes from a C# side, but covers various options available in .NET (using VBA/WSH might be more .. limiting.) including "standard" and 3rd party. Also, consider *only* opening workbooks that have been updated; the Microsoft Sync Framework (or an approach like it) could be used with a "cache store" - if only some documents are updated, this could very greatly speed up the process by only needing to read a little bit of [new] data. –  Feb 12 '13 at 20:33
  • 1
    @pst, fair enough... ;-) – Peter Albert Feb 12 '13 at 20:34
  • You can try to have a local copy of the query xlsx on your local box instead of opening from sharepoint. You could shedule the process of getting the local copies ahead of your actual running of this query vba stuff. – rene Feb 12 '13 at 20:39
  • 1
    Are you using the http path, or the direct "\\someserver\foldername" UNC? The latter might be faster. You can likely also get better throughput by splitting the work over two or more PC's... How large are the files? – Tim Williams Feb 12 '13 at 21:14
  • @TimWilliams, I'm using the UNC path: `\\wss\sites\mysite\thisBk.xlsx`. Files are about 1 MB-5 MB. That's a good idea about splitting the work on several PCs, I'll probably do that. – sigil Feb 12 '13 at 22:04
  • 1
    1-5MB is a good size, so ~3sec per workbook doesn't seem slow: you just have a *lot* of files... @pst 's suggestion of keeping track of last-updated times seems like a good modification to make. – Tim Williams Feb 12 '13 at 22:21
  • Elementary suggestion but does `application.screenupdating = false` speed up the process a little bit? Another relatively easy change might be to Dim another object variable like `Dim newXLApp as Object` and create a new instance of Excel using `Set newXLApp = CreateObject("Excel.Application")` and use this new instance of Excel to iterate over the ~14,000 files. You can keep it hidden with `newXLApp.Visible = False` which should be faster than if the application is visible. – David Zemens Feb 20 '13 at 00:38
  • possible duplicate of [Query my excel worksheet with VBA](http://stackoverflow.com/questions/18637376/query-my-excel-worksheet-with-vba) –  Sep 17 '14 at 14:54

1 Answers1

1

I would use ADO and connect to the Excel workbooks as databases, allowing you to issue SQL statements against them. I would definitely do this for reading the data, but also for the writing (see here).

Saving the data in Access would certainly be an improvement over saving it in Excel, and you could consider scheduling a compact every so often to keep the database size down.

Community
  • 1
  • 1
Zev Spitz
  • 13,950
  • 6
  • 64
  • 136
  • 1
    ADO + SQL is a lot faster than opening/closing workbooks. I like this approach better, although there are times when it won't work (e.g., getting the values of custom properties in worksheets). – sigil Feb 20 '13 at 22:57
  • If the documents are .xlsx, another option to consider would be the Office Open XML SDK, which would probably be even faster. OTOH, I am unaware if you can use the SDK as a COM component, which would be necessary if doing this from VBScript or VBA. – Zev Spitz Feb 21 '13 at 07:22
  • @JasonR.Mick And now we can both delete our previous comments on the issue. – Zev Spitz May 12 '16 at 07:05