1

I have a database which I regularly need to import large amounts of data into via some python scripts. Compacted, the data for a single months imports takes about 280mb, but during the import file size swells to over a gb.

Given the 2gb size limit on mdb files, this is a bit of a concern. Apart from breaking the inserts into chunks and compacting inbetween each, are there any techniques for avoiding the increase in file size?

Note that no temporary tables are being created/deleted during the process: just inserts into existing tables.

And to forstall the inevitable comments: yes, I am required to store this data in Access 2003. No, I can't upgrade to Access 2007.

If it could help, I could preprocess in sqlite.

Edit:

Just to add some further information (some already listed in my comments):

  • The data is being generated in Python on a table by table basis, and then all of the records for that table batch inserted via odbc
  • All processing is happening in Python: all the mdb file is doing is storing the data
  • All of the fields being inserted are valid fields (none are being excluded due to unique key violations, etc.)

Given the above, I'll be looking into how to disable row level locking via odbc and considering presorting the data and/or removing then reinstating indexes. Thanks for the suggestions.

Any further suggestions still welcome.

Fionnuala
  • 90,370
  • 7
  • 114
  • 152
mavnn
  • 9,101
  • 4
  • 34
  • 52
  • 2
    Q1: Have you optimised field settings e.g. string length? Q2: Dropping indexes pre-load and recreating them afterwards will speed things up and may save space. – heferav Oct 30 '09 at 16:28
  • I'll be interested in the response. There are many limits in Access that can be endured or worked around, but this one has seemed to me to be critical. – Smandoli Oct 30 '09 at 16:30
  • @heferav Speed isn't an issue (I only need to do this once a month on a single user database), but I will see if removing indexes helps on size. @Remou Why remove the ms-access-2003 tag? I don't want to get into a tagging war, but isn't the version or access relevant? – mavnn Oct 30 '09 at 16:35
  • What do you mean by "preprocess"? We don't know enough about the process to guess what might be taking up space but if there's some preprocessing going on that involves the database (instead of text file or in-memory manipulation) that's probably where the problem is. – Larry Lustig Oct 30 '09 at 16:37
  • @Larry I'm inserting values via ODBC (using pyodbc). At the moment I'm creating the data via a python script and then inserting into Access direct: if, for example, there is a way of bulk transferring from sqlite it might be more efficient. No processing is being done in Access. – mavnn Oct 30 '09 at 16:41
  • @mavnn There are 12 posts tagged 2003, so it does not get much attention. There are not so many differences between the bunch of Access versions between '97 and 2007. I had no intention of offending :) – Fionnuala Oct 30 '09 at 16:47
  • To me the important thing is the ms-access tag as that'll get a lot more views. – Tony Toews Oct 30 '09 at 17:58
  • Not offended, just curious :) – mavnn Oct 30 '09 at 18:08
  • If you eliminate useful information by removing a tag, you should insert what you've removed into the text of the post. I'd do it myself, but I'm torn by the fact that this is not really an Access question at all, but a Jet/ACE question. – David-W-Fenton Oct 30 '09 at 23:26

6 Answers6

4

Are you sure row locking is turned off? In my case, turning off row locking reduced bloat by over a 100 megs when working on a 5 meg file. (in other words the file barley grew after turning off row locking to about 6 megs). With row locking on, the same operation results in a file well over 100 megs in size.

Row locking is a HUGE source of bloat during recordset operations since it pads each record to a page size.

Do you have ms-access installed here, or are you just using JET (JET is the data engine that ms-access uses. You can use JET without access).

Open the database in ms-access and go:

Tools->options On the advanced tab, un-check the box: [ ] Open databases using record level locking.

This will not only make a HUGE difference in the file growth (bloat), it will also speed things up by a factor of 10 times.

There also a registry setting that you can use here.

And, Are you using odbc, or an oleDB connection?

You can try:

Set rs = New ADODB.Recordset With rs .ActiveConnection = RsCnn .Properties("Jet OLEDB:Locking Granularity") = 1

Try the setting from accesss (change the setting), exit, re-enter and then compact and repair. Then run your test import…the bloat issue should go away.

There is likely no need to open the database using row locking. If you turn off that feature, then you should be able to reduce the bloat in file size down to a minimum.

For furher reading and an example see here: Does ACEDAO support row level locking?

Community
  • 1
  • 1
Albert D. Kallal
  • 42,205
  • 3
  • 34
  • 51
  • Ahh, now that could make a lot of difference too. And that was new in Jet 4.0 so that could correlate with my experience. Now I thought that the first user accessing the BE MDB specified the record level locking and all subsequent users in used that users options. Or something like that. – Tony Toews Oct 30 '09 at 20:24
  • Thank you for this: it made a huge impact to both speed and file size during the imports. Compacting afterwards made almost no difference at all. – mavnn Nov 02 '09 at 09:53
3

A common trick, if feasible with regard to the schema and semantics of the application, is to have several MDB files with Linked tables.

Also, the way the insertions take place matters with regards to the way the file size balloons... For example: batched, vs. one/few records at a time, sorted (relative to particular index(es)), number of indexes (as you mentioned readily dropping some during the insert phase)...

Tentatively a pre-processing approach with say storing of new rows to a separate linked table, heap fashion (no indexes), then sorting/indexing this data is a minimal fashion, and "bulk loading" it to its real destination. Similar pre-processing in SQLite (has hinted in question) would serve the serve purpose. Keeping it "ALL MDB" is maybe easier (fewer languages/processes to learn, fewer inter-op issues [hopefuly ;-)]...)

EDIT: on why inserting records in a sorted/bulk fashion may slow down the MDB file's growth (question from Tony Toews)
One of the reasons for MDB files' propensity to grow more quickly than the rate at which text/data added to them (and their counterpart ability to be easily compacted back down) is that as information is added, some of the nodes that constitute the indexes have to be re-arranged (for overflowing / rebalancing etc.). Such management of the nodes seems to be implemented in a fashion which favors speed over disk space and harmony, and this approach typically serves simple applications / small data rather well. I do not know the specific logic in use for such management but I suspect that in several cases, node operations cause a particular node (or much of it) to be copied anew, and the old location simply being marked as free/unused but not deleted/compacted/reused. I do have "clinical" (if only a bit outdated) evidence that by performing inserts in bulk we essentially limit the number of opportunities for such duplication to occur and hence we slow the growth.

EDIT again: After reading and discussing things from Tony Toews and Albert Kallal it appears that a possibly more significant source of bloat, in particular in Jet Engine 4.0, is the way locking is implemented. It is therefore important to set the database in single user mode to avoid this. (Read Tony's and Albert's response for more details.

mjv
  • 73,152
  • 14
  • 113
  • 156
  • It is possible, but awkward as copies of the database are made regularly and moved to different file locations. I will go this route if required, but avoiding it is one of the major reasons for the question. – mavnn Oct 30 '09 at 16:38
  • Could you be more specific as to how batched vs one/few records makes a difference ti Access databases? As well as presorting. – Tony Toews Oct 30 '09 at 17:56
  • @Tony Toews. See edit. The explanation is an educated guess, certainly backed by hands-on experience. I'm sure we could find online authoritative info about the inner working of the Jet Engine which would shed further light on the topic. Maybe part of the reason why there's relatively little documentation about this and also why the engine is not improved in this area, is that this only affect a small number of applications. If you're "small data" no problem, if you're "big data" you use "real" dbmses, leaving only medium sized app like this one to suffer ;-) – mjv Oct 30 '09 at 19:45
  • I see where you are coming from however I disagree with it and I I like Albert's explanation a lot better. – Tony Toews Oct 30 '09 at 21:24
  • @Tony Toews Agreed on Albert excellent explanation about exclusive access/locking. Bloat happens too in exclusive access mode, at least it did in earlier Jet Engine version. (What goes without saying also goes when we say it and we/I should have stressed the OP about the interest of accessing the database in exclusive / no locks mode). On your disagreement, is it about the technical merit of the explanation or on the opinion that MDB is not a "real" dbms and receives relatively little engineering efforts (compared to say MSSQL). – mjv Oct 30 '09 at 23:12
  • My disagreement was with the technical merits of your argumen and not with your comment with respect to "real" dmsses. Your attitude there was reasonably sane and expressed in a way similar to what I would've said. Unlike others on this online forum. That said, as I stated in my reply, I saw a lot more bloat in Jet 4.0/Access 2000 systems than I did in Jet 3.51/Access 97. It was the same app which I was working with for about five years on and off. – Tony Toews Oct 31 '09 at 03:54
  • Tony, gotcha. I amended my prose accordingly. You and Albert seem to have more recent exposure to Jet/MDB; I still think that bulk processing could also provide a worthy improvement (but then again I saw the OP's added description on how individual rows are fed with ODBC!... so who cares...). In case, if Albert or yourself feel that I'm off-base, I'll be pleased to delete my post for clarity sake. cheers. – mjv Oct 31 '09 at 04:37
3

One thing to watch out for is records which are present in the append queries but aren't inserted into the data due to duplicate key values, null required fields, etc. Access will allocate the space taken by the records which aren't inserted.

About the only significant thing I'm aware of is to ensure you have exclusive access to the database file. Which might be impossible if doing this during the day. I noticed a change in behavior from Jet 3.51 (used in Access 97) to Jet 4.0 (used in Access 2000) when the Access MDBs started getting a lot larger when doing record appends. I think that if the MDB is being used by multiple folks then records are inserted once per 4k page rather than as many as can be stuffed into a page. Likely because this made index insert/update operations faster.

Now compacting does indeed put as many records in the same 4k page as possible but that isn't of help to you.

Tony Toews
  • 7,850
  • 1
  • 22
  • 27
1

Is your script executing a single INSERT statement per row of data? If so, pre-processing the data into a text file of many rows that could then be inserted with a single INSERT statement might improve the efficiency and cut down on the accumulating temporary crud that's causing it to bloat.

You might also make sure the INSERT is being executed without transactions. Whether or not that happens implicitly depends on the Jet version and the data interface library you're using to accomplish the task. By explicitly making sure it's off, you could improve the situation.

Another possibility is to drop the indexes before the insert, compact, run the insert, compact, re-instate the indexes, and run a final compact.

David-W-Fenton
  • 22,871
  • 4
  • 45
  • 58
0

I find I am able to link from Access to Sqlite and to run a make table query to import the data. I used this ODBC Driver: http://www.ch-werner.de/sqliteodbc/ and created User DNS.

Fionnuala
  • 90,370
  • 7
  • 114
  • 152
  • Unfortunately I don't have admin rights to install the driver. +1 for the help though. – mavnn Oct 30 '09 at 18:12
  • Why the down votes? This post relates to a comment by the op "if, for example, there is a way of bulk transferring from sqlite it might be more efficient. No processing is being done in Access." – Fionnuala Oct 30 '09 at 20:48
-1

File --> Options --> Current Database -> Check below options * Use the Cache format that is compatible with Microsoft Access 2010 and later * Clear Cache on Close

Then, you file will be saved by compacting to the original size.

Pavan
  • 1
  • The above option has nothing to do with bloat in general. The cache format you speak of ONLY applies when using Access and SharePoint back end tables. If you check that option, then Access will use the NEW high speed format and allow “off line” caching of data. In fact checking this option will FORCE access to make a LOCAL COPY of the data from SharePoint – it will thus in fact do 100% opposite as the posting here desires and the results will be significantly increased file size. Since this post is not about SharePoint tables, then your post is off topic and not relevant. – Albert D. Kallal Jun 25 '15 at 20:09