4

I am using Delphi 2009. I have a very simple data structure, with 2 fields:

  1. A string that is the key field I need to retrieve by and is usually 4 to 15 characters in length.
  2. A string that is the data field that may be any size, from 1 character up to say 10,000 characters.

The difficulty is that I may have several million of these records, so they may total as much or more than 10 GB in size. Obviously, I'm looking for an on-disk solution rather than an in-memory solution.

My program needs to randomly retrieve these records, based on the key field. That's that part that needs to be made as efficient as possible.

Should I use a database for such a simple structure, and if so, which database would be best to handle this and be simplest to implement?

Alternatively, is there a simple on-disk data structure not requiring a full-blown database that would work just as well?


Well, all I needed was the one answer to kick me back into reality. I was looking for something simpler than even a simple database. But when the no-duh answer is to use a database, then I realize I've already answered this question with my own answer to another question: Best database for small applications and tools.

My answer was DISQLite3 for the reasons I specified there. And that's what I'll probably with for my implementation.


A few more good answers with some possibilities. That's great. I'll be able to try a few different methods to see what works best.


More contemplation, and I have had to change the accepted answer to the GpStructuredStorage solution.

In my case, a million records totalling several Gigabytes will put a strain on a database structure. Specifically, the B* tree that is used to store the index in most databases is fast, but will slow down for some operations such as reindexing a million values.

The only thing you'll find faster than B* for an index is a hash table. And that is precisely what is provided in gabr's suggested addition to the GpStructuredStorage solution. I think it's quite elegant the way he segmented the hash value to give a 4 level directory structure.

The key reason why I can go to a hash solution is that I only need random access by the keys. I do not need to sort by the keys. If sorting was needed, then the gains of the speed of the hash table would be lost and the database system would be a no-brain winner.

When I get down to implementing this, I should do a comparison of this technique versus a database. Maybe I'll compare with both Firebird and SQLite which would both be worthy opponents.


One more followup:

I just discovered Synopse Big Table by A. Bouchez which is designed for speed and meets the specs of my question almost precisely. I'll be trying it out first when I do my implementation in a few months and will report back here with my results.


A much later followup (July 2015)

I never did get to try Synopse Big Table. I've stuck with my B* tree up to now. But now I've upgraded to Delphi XE8 and plan to go with the database solution using FireDAC with SQLite.

Community
  • 1
  • 1
lkessler
  • 19,819
  • 36
  • 132
  • 203
  • Do you only need read only access to the data? Or will you randomly be adding records also? – skamradt Nov 25 '09 at 21:11
  • I will need to modify, add, and delete records as well. Modify may mean changing the data length of the record. – lkessler Nov 26 '09 at 07:04

7 Answers7

5

For more than 10GB in data, a database is exactly what you need. It will handle indexing for rapidly locating data (your random retrieval), the functionality for adding, modifying, and deleting data, and the actual storage, as well as much more if you so choose.

There are dozens of posts here related to which databases are available for use in Delphi, including built-ins and FOS ones like Firebird.

Ken White
  • 123,280
  • 14
  • 225
  • 444
  • So which database would be best geared towards this specific data structure, or are you saying it doesn't matter? – lkessler Nov 25 '09 at 20:30
  • It doesn't matter. There are several viable alternatives; Firebird is one I've used myself and it works really well, and it's free to boot. :-) There are others that may work better, depending on your specific needs (contained fully in the executable, or no DBA required, etc.); I can't recommend anything more specific without more info on what you're actually doing. – Ken White Nov 25 '09 at 20:51
  • It does seem a shame to carry around an entire database for a single two-column. So I'd choose the smallest, simplest, least-capable database that meets your performance requirements. In particular, you don't seem to need SQL or relationality, so there would be no value to you in having those features supported. BDB (below) would be a good option if Delphi bindings are available. – Larry Lustig Nov 25 '09 at 21:35
  • 1
    @Larry: I'd have to disagree. :-) 10GB of data needing quick random access is more than a simple two-column database. Of course, your definition of "quick" might be different than mine. Considering that the *standard* (note the italics) RAM in a desktop now is 3 or 4 GB, 10GB is pretty big. The key is the "quick" part, which cries out for an optimized database with good index capabilities. Zeos is a good library for accessing several databases, as long as you're not using D2009 or higher; they don't have a non-beta Unicode version available yet. – Ken White Nov 25 '09 at 21:44
  • 2
    @Larry: Support for unneeded features doesn't take anything away. However, a mature database system has probably much more work invested already into file access and caching strategies than one could reasonably achieve on their own, when writing a "simple" on-disk data structure. Why not make use of it? – mghie Nov 25 '09 at 21:45
4

why all the bragging? just use GpStructuredStorage(4 TB restriction and with little time invested in class you can go over), it will take you few hours to get used to it but it worths the time. Hope it helps

GpStructuredStorage can retrieve the file names extremely fast(I've tested it), you need to save each record as a file in GpStructuredStorage and retrieve each name as a string in a string list, 1 milion string names(because you mentioned about stringlist) needs a few MB in RAM which is not much, use a TStream descendant to write data to a file in GpStructuredStorage, I do not have time today to write a simple example, but on Saturday or Sunday I will write a tutorial for GPStructuredStorage on my blog.


[Added by gabr - I hope that will not be considered a terrible breach of netiquette. It's just that my thoughts don't fit in a comment (sizewise) and that it feels stupid to add another answer just to write this ...]

While GpStructuredStorage can store loads of data, finding it may be a slow process. What I usually do in such cases is to create a hash of the key and convert it into hex (say 00FFA784). Then I convert this hex hash into folder structure (in this case it would be /00/FF/A7/84) and store relevant data in this folder, either as a file, as attributes or a combination of both.

This approach speeds data retrieval but slows down data insertions and is therefore recommended only for mostly static data. If the data is fairly dynamic I'd certainly recommend using database and not GpStructuredStorage.

gabr
  • 26,580
  • 9
  • 75
  • 141
  • I heard about GPStructuredStorage, but wouldn't have tought it might be applicable to this problem. Maybe it is. The only question would be once it's got a million items in it, will it become as slow as a Windows file system, or did gabr optimize the retrieval with a fast tree or his great cache? I might just try this when I compare. Price is right. – lkessler Nov 26 '09 at 00:26
  • Actually, I have no idea how GpStructuredStorage would behave on such big set but you're certainly welcome to try :) – gabr Nov 26 '09 at 09:04
  • As it turns out, the database will be created all at once the first time (imported from another data set). After that, it will be updated on a record by record basis (only a few records at a time), so maybe in my case, with gabr's suggestion on using your hash function as the folder path, GpStructuredStorage might be a possible solution. – lkessler Nov 26 '09 at 15:47
  • I especially like gabr's idea of grouping the hash into 2 character combinations to make a 4 level folder structure. With a million records, each level will only average 33 entries which should make finding the files fairly quick, even if just a sequential directory/file search is used. – lkessler Nov 26 '09 at 15:50
  • gabr: In http://stackoverflow.com/questions/222699/which-embedded-database-to-use-in-a-delphi-application/222725#222725 you answered Firebird, and I commented agreeing with you. But I haven't tried Firebird yet. How would you compare it with the GPStructuredStorage solution for my particular problem? – lkessler Nov 26 '09 at 15:57
  • I don't think that quickly finding the data would be a problem for any database, but it's very important to cause as little I/O as possible when searching for and retrieving data. I have no experience with `GpStructuredStorage`, so I can't comment on it, but database systems make sure that indexing structures are as small as possible and in adjacent disc blocks so as to minimize hard disc seek times. Does `GpStructuredStorage` maintain folder structure information in one continuous area? Is there any technique to minimize or even prevent fragmentation caused by file modifications? – mghie Nov 26 '09 at 16:40
  • delphigeist: I'm looking forward to your tutorial on GPStructuredStorage. Have you compared its speed to a database? – lkessler Nov 27 '09 at 14:26
2

You should analyse your data. If

  1. a sizeable part of the data values is larger than the default file system block size,
  2. you don't want to search in the data values using SQL (so it doesn't matter what format they are stored in), and
  3. you really need random access over the whole database,

then you should test whether compressing your data values increases performance. The decompression of data values (especially on a modern machine with multiple cores, performed in background threads) should incur only a small performance hit, but the gains from having to read fewer blocks from the hard disc (especially if they are not in the cache) could be much larger.

But you need to measure, maybe the database engine stores compressed data anyway.

mghie
  • 32,028
  • 6
  • 87
  • 129
  • That's a good thought. Compressing may speed up. I'll implement first, see if it's fast enough. If not, I'll try your suggestion. – lkessler Nov 26 '09 at 00:18
1

BerkleyDB is exactly that

glebm
  • 20,282
  • 8
  • 51
  • 67
  • Does it have Delphi bindings? A quick google doesn't show much. – Larry Lustig Nov 25 '09 at 20:19
  • @Glex: One developer, no download available, and the SVN contains 2 revisions, the last one nearly 3 years old. Doesn't look that promising. – mghie Nov 26 '09 at 04:59
  • These are just bindings dude. If BerkleyDB changed that much in the last 3 years, it won't be that hard to fix the API changes anyway. – glebm Nov 26 '09 at 14:37
  • Maybe BerkleyDB didn't change that much, but I doubt the files will work without any changes in Delphi 2009 or 2010. Did you try? – mghie Nov 26 '09 at 16:45
  • There is an article in the issues section on what to change so that it works in Delphi 7. Delphi >7 is a strict subset of Delphi 7. However, good point, I have not tried it. – glebm Nov 26 '09 at 18:21
1

If you more often need large datasets, and have some money to spare, simply stuff 16GB ( 500-750 Eur) in a machine, and make a separate process with some 64-bit compiler (*) of it that you query over e.g. shared mem or other IPC method.

In that case you can use the in-memory approach till Delphi 64-bit finally comes out. Since your data seems to be simple ( a map from array of char to array of char) it is easily to export over IPC.

This is of course if this approach has any merit for your case (like it is a cache or so), which I can't determine from your question.

(*) I recommend FPC of course :-)

I did this once, till about 5 million objects, 5 GB of data.

I got permission to open source the container types I made for it, they are at:

http://www.stack.nl/~marcov/lightcontainers.zip (warning: very dirty code)

mghie: to answer in another cliche: There is no silver bullet

Databases have a lot of other assumptions too

  • their generalized approach make relative inefficient use of memory. Most notably your dataset using normal memory storage techniques falls inside the affordable memory ranges, which are of course typically bigger for a server (my bad assumption here, apparantly) than for a client.
  • databases assume that their resultsets can reduced to small sets within the database-server with a relative straight kind of processing, and assisted by indexing.
  • they have a relatively high latency.
  • they are relatively bad in some kinds of processing (like multidimensional analysis/ OLAP, which is why databases need to be extended for that)

This makes databases relatively bad for use in e.g. caches, loadbalancers etc. Of course that is all provided that you need the speed. But the initial question felt a bit speed-sensitive to me.

In a past job my function in an database oriented firm was to do everything but that, IOW fix the problems when the standard approach couldn't hack it (or required 4 socket Oracle servers for jobs where the budget didn't warrant such expenses). The solution/hack written above was a bit of OLAPpy, and connected to hardware (a rfid chipprogramming device), requiring some guaranteed response time. Two months of programming time, still runs, and couldn't even buy a windows server + oracle license for the cost.

Marco van de Voort
  • 25,628
  • 5
  • 56
  • 89
  • 1
    The problem with that approach is that all of the data needs to be read into memory on application startup, or at least the indexing structures need to. The indexing structures need to be calculated and written first, too. When developing all this one starts to duplicate much of the work that has been done already for database systems... – mghie Nov 25 '09 at 21:49
  • Yes, mghie identifies the main problem. Also, it would be very expensive for me to buy all my users 16 GB of RAM for their machines. :-) – lkessler Nov 26 '09 at 00:08
  • lkessler: If it is for an user app, forget it. mghie: I need to have more room to comment on you, I'll do that in the post. – Marco van de Voort Nov 26 '09 at 10:53
  • @Marco: Thanks for the edit, I understand now where you're coming from and agree with most points. But of your 4 points about database assumptions at least the second is a clear match for this question, and the fourth doesn't matter here. The other two are of real concern, but while I think a tuned hand-written solution can easily outperform many databases it is doubtful that the development effort would be well spent, in the context of this question. After all it's probably only a part of the whole application. Effort should be spent only *after* it's clear (measured!) that's the bottleneck. – mghie Nov 26 '09 at 12:03
  • In this case not I guess (though I'd have to rewind the original question to which I originally replied). I was hacking synedit late at night when I replied it, and I was frustrated. I assume you know the feeling :-) – Marco van de Voort Nov 26 '09 at 22:10
1

Since your data is more than 3GB, you will need to make sure what ever database engine you select either handles tables that large, or split things up into multiple tables, which I would suggest doing no matter what the maximum size of a single table. If you perform the split, perform it as evenly as possible on a logical key break so that its easy to determine which table to use by the first or first two characters of the key. This will greatly reduce the search times by eliminating any records which could never match your query to start with.

If you just want raw performance, and will only be performing read only lookups into the data, then your better served by an ordered index file(s) using a fixed size record for your keys which points to your data file. You can then perform a binary search easily on this data and avoid any database overhead. For even more of a performance gain, you can pre-load/cache the midpoints into memory to reduce repetitive reads.

A simple fixed size record for your specs might look like:

type
  rIndexRec = record
    KeyStr  : String[15];  // short string 15 chars max
    DataLoc : integer;     // switch to int64 if your using gpHugeFile
  end;

For initial loading, use the Turbo Power sort found in the SysTools, which the latest version for Delphi 2009/2010 can be downloaded on the songbeamers website. The DataLoc would be the stream position of your datastring record, which writing/reading might look like the following:

function WriteDataString(aDataString:String;aStream:tStream):integer;
var
  aLen : integer;
begin
  Result := aStream.Position;
  aLen := Length(aDataString);
  aStream.Write(aLen,sizeOf(aLen));
  aStream.Write(aDataString[1],aLen*sizeOf(Char));
end;

function ReadDataString(aPos:Integer;aStream:tStream):String;
var
  aLen : integer;
begin
  if aStream.Position <> aPos then
    aStream.Seek(aPos,soFromBeginning);
  result := '';
  aStream.Read(aLen,SizeOf(aLen));
  SetLength(Result,aLen);
  if aStream.Read(Result[1],aLen*sizeOf(Char)) <> aLen*SizeOf(Char) then
    raise Exception.Create('Unable to read entire data string');
end;

When you are creating your index records, the dataloc would be set to datastring record position. It doesn't matter the order in which records are loaded, as long as the index records are sorted. I used just this technique to keep a 6 billion record database up to date with monthly updates, so it scales to the extreme easily.

EDIT: Yes, the code above is limited to around 2GB per datafile, but you can extend it by using gpHugeFile, or segmenting. I prefer the segmenting into multiple logical files < 2gb each, which will take up slightly less disk space.

skamradt
  • 15,366
  • 2
  • 36
  • 53
  • Your code is limited to 2GB files until a 64 bit Delphi becomes available. – mghie Nov 25 '09 at 21:52
  • Possibly. It would require timing tests to see how fast it is and comparisons to already developed databases to verify. – lkessler Nov 26 '09 at 00:17
1

Synopse Big Table by A. Bouchez. See his answer to my other question about SQLite/DISQLite.

It wasn't even developed when I first asked this question, but now it's a quite mature and fully functional unit.

Community
  • 1
  • 1
lkessler
  • 19,819
  • 36
  • 132
  • 203