13

I have some 'small' text files that contain about 500000 entries/rows. Each row has also a 'key' column. I need to find this keys in a big file (8GB, at least 219 million entries). When found, I need to append the 'Value' from the big file into the small file, at the end of the row as a new column.

The big file that is like this:

KEY                 VALUE
"WP_000000298.1"    "abc"
"WP_000000304.1"    "xyz"
"WP_000000307.1"    "random"
"WP_000000307.1"    "text"
"WP_000000308.1"    "stuff"
"WP_000000400.1"    "stuffy"

Simply put, I need to look up 'key' in the big file.

Obviously I need to load the whole table in RAM (but this is not a problem I have 32GB available). The big file seems to be already sorted. I have to check this.
The problem is that I cannot do a fast lookup using something like TDictionary because as you can see, the key is not unique.

Note: This is probably a one-time computation. I will use the program once, then throw it away. So, it doesn't have to be the BEST algorithm (difficult to implement). It just need to finish in decent time (like 1-2 days). PS: I prefer doing this without DB.

I was thinking at this possible solution: TList.BinarySearch. But it seems that TList is limited to only 134,217,727 (MaxInt div 16) items. So TList won't work.


Conclusion:
I choose Arnaud Bouchez solution. His TDynArray is impressive! I totally recommend it if you need to process large files.
AlekseyKharlanov's provided another nice solution but TDynArray is already implemented.

Community
  • 1
  • 1
Gabriel
  • 20,797
  • 27
  • 159
  • 293
  • 1
    Have you considered using a [MultiMap](https://en.wikipedia.org/wiki/Multimap)? – A. Sarid Sep 25 '16 at 10:25
  • 3
    What you are describing is a binary search.It's very fast, however your data should always be sorted.So inserting an element will be expensive.You could use a tree to enable fast search and fast inserts. – O_Z Sep 25 '16 at 10:26
  • 11
    "Note: This is a one-time computation" - You could import it into SQL, and query your data easily. – kobik Sep 25 '16 at 10:35
  • @kobik I never worked with SQL. I need to do over 3 million lookups in that 8GB file. Is SQL fast enough? – Gabriel Sep 25 '16 at 10:36
  • @O_Z - Thanks. First I will try to see if Delphi has some out-of-the-box binary search libraries, otherwise I have to write it myself. [] The file seems to be sorted. It is way to big to check it all manually. I need to write a small program to confirm this. – Gabriel Sep 25 '16 at 10:37
  • This could be a solution: http://stackoverflow.com/questions/8051327/how-can-i-search-a-generic-tlist-for-a-record-with-a-certain-field-value – Gabriel Sep 25 '16 at 10:39
  • 2
    @Silvester if you are already loading the file into a data structure you should really consider using MultiMap instead of your dictionary. see http://stackoverflow.com/questions/8807332/delphi-multiple-indexed-generic-list/8808361 – A. Sarid Sep 25 '16 at 10:43
  • 3
    @Silvester, SQL server is pretty fast. I'm not exactly sure what your final objective, but having this data in a DB with an Index on "KEY" field will make your life easy – kobik Sep 25 '16 at 11:16
  • @A.Sarid-Aare referring to menjaraz's answer? – Gabriel Sep 25 '16 at 11:31
  • @Silvester yes, exactly. I guess you have here few options to check: 1) Storing the whole file in memory using MultiMap. 2) Using binary search to find the key you need. 3) Using SQL. – A. Sarid Sep 25 '16 at 12:23
  • 6
    "I will use the program once, then throw it away"; **how many queries** are you making against the data? If you're only making one query, you can just use `awk`. It's (relatively) slow, but it'll still complete in less time than it takes to write a program from scratch. – Roger Lipscombe Sep 25 '16 at 15:33
  • Is your incoming data already sorted? It seems so. – Arnaud Bouchez Sep 25 '16 at 16:42
  • 1
    @RogerLipscombe-I have to do 8*410000 look ups against the big file which has at least 219 million entries. – Gabriel Sep 25 '16 at 18:10
  • @ArnaudBouchez-It seems to be sorted. I need to check this. Cannot be done 'by hand'. – Gabriel Sep 25 '16 at 18:11
  • The Q makes little sense because of contradictions. E.g.: since it is one-time job, it doesn't really need to be fast at all. Even naive approach can do 34G rows in say two day timeframe. – Free Consulting Sep 25 '16 at 21:24
  • 1
    I can't see any reason why you need anything beyond a hammer to crack this nut. Assuming you have two flat files, the data provided above and the list of required keys then sort both (cmd's sort utility will be adequate) then read one record each from the files. if the required key matches the data, output a record to a new *matches* file and read the next data record. If not, then if the data-key is less than required-key, (output to *notselected*) and read the next data record. Otherwise, read next required-key. No DB, SQL or read-to-memory required, just old-fashioned serial processing. – Magoo Sep 25 '16 at 21:27
  • Not sure why everyone keeps talking about binary searches. Does delphi not have a multi-hashmap? – Mooing Duck Sep 25 '16 at 22:26
  • 2
    It doesn't seem that the problem is well defined. What is the expected output given the example? And what is 'key'? Since the "key is not unique", do you mean to say you wish to find the row numbers of all keys that are 'key' or do you want to find the row number of a specific set of keys like `WP_000000298.X` that has a value that is 'key' (in which case confusing example value)? – John B Sep 26 '16 at 01:29
  • 1
    It is not clear - what is desired result? Show short example, please – MBo Sep 26 '16 at 01:46
  • @MBo-I updated my question. – Gabriel Sep 26 '16 at 05:46
  • So Aleksey Kharlanov proposal looks reasonable - you don't need data structures for huge file. – MBo Sep 26 '16 at 06:37
  • This is something that is relatively easily done with the standard tools available on Linux without much coding, especially when you can cache it all in RAM. Similar tools exist for Windows. – Thorbjørn Ravn Andersen Sep 26 '16 at 07:34
  • @ThorbjørnRavnAndersen-Please post this as a full answer. And name the Windows tool :) – Gabriel Sep 26 '16 at 07:42

7 Answers7

17

Instead of re-inventing the wheel of binary search or B-Tree, try with an existing implementation.

Feed the content into a SQLite3 in-memory DB (with the proper index, and with a transaction every 10,000 INSERT) and you are done. Ensure you target Win64, to have enough space in RAM. You may even use a file-based storage: a bit slower to create, but with indexes, queries by Key will be instant. If you do not have SQlite3 support in your edition of Delphi (via latest FireDAC), you may use our OpenSource unit and its associated documentation.

Using SQlite3 will be definitively faster, and use less resources than a regular client-server SQL database - BTW the "free" edition of MS SQL is not able to handle so much data you need, AFAIR.

Update: I've written some sample code to illustrate how to use SQLite3, with our ORM layer, for your problem - see this source code file in github.

Here are some benchmark info:

  with index defined before insertion:
    INSERT 1000000 rows in 6.71s
    SELECT 1000000 rows per Key index in 1.15s

  with index created after insertion:
    INSERT 1000000 rows in 2.91s
    CREATE INDEX 1000000 in 1.28s
    SELECT 1000000 rows per Key index in 1.15s

  without the index:
    INSERT 1000000 rows in 2.94s
    SELECT 1000000 rows per Key index in 129.27s

So for huge data set, an index is worth it, and creating the index after the data insertion reduces the resources used! Even if the insertion is slower, the gain of an index is huge when selecting per key.. You may try to do the same with MS SQL, or using another ORM, and I guess you will cry. ;)

Arnaud Bouchez
  • 42,305
  • 3
  • 71
  • 159
  • 4
    The limit for SQL Server Express Edition is currently 10 GB per database. See https://www.microsoft.com/en-us/cloud-platform/sql-server-editions-express – Uwe Raabe Sep 25 '16 at 11:04
  • 4
    I suspect that a raw 8GB of input will end up with more than 10 GB when stored in MS SQL, including the index and all the storage overhead. – Arnaud Bouchez Sep 25 '16 at 11:07
  • 1
    "You may try to do the same with MS SQL, or using another ORM" - MS SQL is not an ORM. And ORM is a powerfull tool: and as a tool it should be used where appropriate, not as a silver bullet. – Dima Sep 25 '16 at 19:40
  • 1
    @Dima Try to do the same with MS SQL instead of SQLite3. Period. OR using another ORM, even with SQLite3 to compare performance. I know what an ORM is - I wrote one. ;) – Arnaud Bouchez Sep 25 '16 at 20:35
  • 3
    MS SQL Server is a server database. SQLite is embedded DB. Strange that you try to compare them. It is totally possible to do everything you did with SQLite - in SQL Server. If you do not know how it is done it does not mean it cannot be done. Period. But, yes, any DB can kill performance, if misused. – Dima Sep 25 '16 at 21:03
  • 1
    MSSQL and SqLite (and MySql, postgres, Oracle etc) are RDBMS and not ORMs An ORM is a mapping between objects and a RDBMS – mmmmmm Sep 25 '16 at 22:38
  • @Mark and Dima What is your point? I never confused ORM and RDBMS. And this is not at all relevant to the question, nor the answer. My remark was a final note about performance: 1. some people did propose MS SQL for the task, and SQLite3 is better in this case. (I never meant that SQLite3 is better than MSSQL, but for this question, it is the case). 2. It is commonly argued that ORM slow down data processing, and in case of mORMot ORM, it is not the case, due to how it handles batch insertions. Everyone could do the test, and compare the numbers. – Arnaud Bouchez Sep 26 '16 at 15:04
  • Your last sentance in the question makes no sense. " MS SQL, or using another ORM" implies MSSQL is an ORM – mmmmmm Sep 26 '16 at 16:05
  • My intent: 1. Try to do the same with MSQL. OR. 2. Try using another ORM than mORMot. THEN compare numbers. 3. Cry. – Arnaud Bouchez Sep 26 '16 at 21:58
  • I find TDynArray excellent. What is the advantage of this solution (SQL Lite) over your TDynArray? – Gabriel Sep 27 '16 at 08:27
  • @Silvester WIth SQLite, you have all SQL abilities, so you can perform complex queries, if needed. And you can persist the data on disk in a better way than its original text file. See https://www.sqlite.org/appfileformat.html – Arnaud Bouchez Oct 01 '16 at 17:24
10

Another answer, since it is with another solution.

Instead of using a SQLite3 database, I used our TDynArray wrapper, and its sorting and binary search methods.

type
  TEntry = record
    Key: RawUTF8;
    Value: RawUTF8;
  end;
  TEntryDynArray = array of TEntry;

const
  // used to create some fake data, with some multiple occurences of Key
  COUNT = 1000000; // million rows insertion !
  UNIQUE_KEY = 1024; // should be a power of two

procedure Process;

var
  entry: TEntryDynArray;
  entrycount: integer;
  entries: TDynArray;

  procedure DoInsert;
  var i: integer;
      rec: TEntry;
  begin
    for i := 0 to COUNT-1 do begin
      // here we fill with some data
      rec.Key := FormatUTF8('KEY%',[i and pred(UNIQUE_KEY)]);
      rec.Value := FormatUTF8('VALUE%',[i]);
      entries.Add(rec);
    end;
  end;

  procedure DoSelect;
  var i,j, first,last, total: integer;
      key: RawUTF8;
  begin
    total := 0;
    for i := 0 to pred(UNIQUE_KEY) do begin
      key := FormatUTF8('KEY%',[i]);
      assert(entries.FindAllSorted(key,first,last));
      for j := first to last do
        assert(entry[j].Key=key);
      inc(total,last-first+1);
    end;
    assert(total=COUNT);
  end;

Here are the timing results:

one million rows benchmark:
INSERT 1000000 rows in 215.49ms
SORT ARRAY 1000000 in 192.64ms
SELECT 1000000 rows per Key index in 26.15ms

ten million rows benchmark:
INSERT 10000000 rows in 2.10s
SORT ARRAY 10000000 in 3.06s
SELECT 10000000 rows per Key index in 357.72ms

It is more than 10 times faster than the SQLite3 in-memory solution. The 10 millions rows stays in memory of the Win32 process with no problem.

And a good sample of how the TDynArray wrapper works in practice, and how its SSE4.2 optimized string comparison functions give good results.

Full source code is available in our github repository.

Edit: with 100,000,000 rows (100 millions rows), under Win64, for more than 10GB of RAM used during the process:

INSERT 100000000 rows in 27.36s
SORT ARRAY 100000000 in 43.14s
SELECT 100000000 rows per Key index in 4.14s
Arnaud Bouchez
  • 42,305
  • 3
  • 71
  • 159
  • I made a test with our TDynArray wrapper, and 100,000,000 entries, using 10GB or RAM under Win64 - I can't test 230,000,000 since I only have 16GB of RAM. And it worked with no problem. – Arnaud Bouchez Sep 25 '16 at 13:54
  • I calculated that I have about 219 MILLION entries/row in the big (8GB) file. Can TDynArray handle this? I see that the time for 100mil entries is good. I guess it could also handle 219 mil. – Gabriel Sep 25 '16 at 18:02
  • I don't like the DB solution (I don't say it is not good, I just don't like DB). But I like this one. Many thanks! I will try it. – Gabriel Sep 25 '16 at 18:05
  • @Silvester Since I had no problem with 100 MILLION rows on my 16GB PC, I think you will have no problem with 219 millions on yours. But ensure you use AnsiString (like our RawUTF8) and not plain string=UnicodeString for the fields, otherwise you will need doubled RAM, for no benefit. – Arnaud Bouchez Sep 25 '16 at 18:41
  • @arnaud-If I want Delphi AnsiStrings, I need to use djWinAnsi in InitSpecific? – Gabriel Sep 26 '16 at 08:56
  • I tried it on a small set (MB) and it works. I will try it on the 8GB file today. Accepted. Many thanks. – Gabriel Sep 26 '16 at 09:43
  • djWinAnsi is for String(1252) = WinAnsiString in SynCommons.pas - but for this exact purpose djWinansi or djRawUTF8 won't make any difference, since sorting will be case insensitive by default. – Arnaud Bouchez Sep 26 '16 at 14:55
  • I have seen that loading a 700MB file in TDynArray takes about 1500MB of RAM (using djRawUTF8). It seems that it requires 2byes to store an Ansi char. This means that I can only load (in a 32GB computer) 16GB files. I don't know TDynArray internal working. So, I'm just asking: It is possible to decrease memory usage? – Gabriel Sep 27 '16 at 08:18
  • @Silvester No a RawUTF8 stores one Ansi char in one byte. The overhead is in the string header (length+refcount+codepage before the actual text). What you can do, if your keys are sorted, is to re-use the same previous value instance, so you will use the same pointer, with a ref-count+1. In short: do not create new strings (using e.g. copy), if the values are identical. See https://www.delphitools.info/2013/05/13/immutable-strings-in-delphi/ – Arnaud Bouchez Oct 01 '16 at 17:21
  • PS: I think I have found some possible issues when the code is compiled for 64 bit. You use integer instead of lparam: SendMessage( HWND_BROADCAST, WM_WININICHANGE, 0, integer(@Device)). Details: ms-help://embarcadero.rs_xe7/rad/Converting_32-bit_Delphi_Applications_to_64-bit_Windows.html – Gabriel Oct 13 '16 at 10:10
7

Since this is One-time task. Fastest way is to load whole file into memory, scan memory line by line, parse key and compare it with search key(keys) and print(save) found positions.

UPD: If you have sorted list in source file. And assume you have 411000keys to lookup. You can use this trick: Sort you search keys in same order with source file. Read first key from both lists and compare it. If they differs read next from source until they equal. Save position, if next key in source also equal, save it too..etc. If next key is differ, read next key from search keys list. Continue until EOF.

  • 4
    If you scan it line by line no need to load it all into memory. Just read it line by line. File memory used by buffering: 4 KB. – Zan Lynx Sep 25 '16 at 16:37
  • This is 'brute force'. I have 411000 keys to lookup in the big (8GB) file. And there are over 219 million entries in the big file. So 411000 * 219000000 = a huge number. With 'brute force' it will take weeks (months?) to finish. – Gabriel Sep 25 '16 at 17:58
  • "read it line by line" - The code that will interpret the line will add a huge overhead. The lines of text needs to be loaded in some kind of record just to eliminate this overhead. – Gabriel Sep 25 '16 at 18:00
  • You not mention in question how much keys you need to lookup. So of course, for 411000 keys it is not good solution. – Aleksey Kharlanov Sep 25 '16 at 18:44
  • 1
    You can build dictionary for every small file, not for huge one, and look for keys from huge file lines in that dictionary. – MBo Sep 26 '16 at 06:36
  • Actually this idea is not bad at all !! If both files are sorted, I need to find the pos of the first match between small file and big file, then I ONLY do the search from that position down. – Gabriel Sep 26 '16 at 07:05
  • If the big file is ordered and the small files combined fit into memory, you don't even literally _need_ to go as fancy as keeping a priority queue for which "small key" to handle next or sort their records augmented by a file indicator. – greybeard Sep 26 '16 at 07:28
  • @greybeard - Can you elaborate? If I understand correctly, you mean I should simply use the small file unsorted? And do the sort as I go through it? Because I HAVE TO pick up the keys from small file in order (same order used for sorting the big file). Do I miss something? – Gabriel Sep 26 '16 at 07:38
  • @Silvester: I suggest processing all of the small files concurrently, and sketched options to do so. The priority queue would allow to select between n smallest keys from each of n small files (sorted individually) with less than n-1 comparisons. "Globally" sorting the records from the small files would eliminate even that. – greybeard Sep 26 '16 at 08:50
  • 1
    OP might look up an implementation of [Merge Join](https://en.wikipedia.org/wiki/Sort-merge_join). – Lieven Keersmaekers Sep 26 '16 at 09:06
  • @AlekseyKharlanov - This is definitively a nice solution (initially accepted) but I have to implement it. So, I am going to accept Aranud's solution since it is already implemented. MANY THANKS anyway! – Gabriel Sep 26 '16 at 09:42
3

Use memory-mapped files. Just think your file is already read into the memory in its entirety and do that very binary search in memory that you wanted. Let Windows care about reading the portions of file when you do your in-memory search.

You may take any of those sources for start, just do not forget to update them for Win64

http://torry.net/quicksearchd.php?String=memory+mapped+files&Title=No

Community
  • 1
  • 1
Arioch 'The
  • 15,799
  • 35
  • 62
  • Hi Arioch. What I need is a lookup algorithm not a file management algorithm (unless the Memory-mapped-class class already implements some kind of hash search). I think it is ok if I load the whole file in memory. It is only 8GB. I have 32GB available. :) – Gabriel Sep 26 '16 at 06:55
  • @Silvester "The big file seems to be already sorted. I have to check this." - did you check ? if it is then you can just do a binary search. I'd would add a "second level" milestones - would find the keys every 250MB and turn them into a 2nd-tier list - 8*4 = 32 items, but they would narrow your search down to 250MB chunks. Then I would indeed sort small files and do a binary search over every item - but that search would be limited by both that second-tier list and previous key, so would probably narrow down fast. ...or I would use SQL :-) – Arioch 'The Sep 26 '16 at 10:34
1

A method that needs the file to be sorted, but avoids data structures entirely:

You only care about one line, so why even read the bulk of the file?

Open the file and move the "get pointer" (apologies for talking C) halfway through the file. You'll have to figure out if you're at a number or a word, but a number should be close by. Once you know the closest number, you know if it's higher or lower than what you want, and continue with the binary search.

Carl
  • 993
  • 8
  • 14
  • 1
    Instead of "figuring out if you're at a number or a word", why not just read a buffer of about 100 characters and look for the next newline? Then look for the next newline after that, and you now have a whole line you can look at and parse. (I say "about 100" because based on the data posted, reading 100 characters should be enough to include an entire line between two newlines.) – ajb Sep 26 '16 at 03:57
  • I have to do 411000 * 219000000 look ups. This will finish in less than 1 month ONLY if the data is in memory. – Gabriel Sep 26 '16 at 06:51
  • 1
    (The question has been been edited a number of times, including a revision claiming about 3.3 million (8*410000) keys to look up.) – greybeard Sep 26 '16 at 07:24
  • @ajb That is what I mean by figuring it out. – Carl Sep 26 '16 at 13:16
0

Idea based on Aleksey Kharlanov answer. I accepted his answer.
I only copy his idea here because he did not elaborate it (no pseudo-code or deeper analysis of the algorithm). I want to confirm it works before implementing it.

We sort both files (once).
We load Big file in memory (once).
We read Small file line by line from disk (once).

Code:
In the code below, sKey is the current key in Small file. bKey is the current key in the Big file:

LastPos:= 0
for sKey in SmallFile do 
 for CurPos:= LastPos to BigFile.Count do 
  if sKey = bKey 
  then 
    begin 
     SearchNext  // search (down) next entries for possible duplicate keys
     LastPos:= CurPos
    end
  else 
    if sKey < bKey 
    then break

It works because I know the last position (in Big file) of the last key. The next key can only be somewhere UNDER the last position; ON AVERAGE it should be in the next 440 entries. However, I don't even have to always read 440 entries below LastPos because if my sKey does not exist in the big file, it will be smaller than the bKey so I quickly break the inner loop and move on.

Thoughts?

Gabriel
  • 20,797
  • 27
  • 159
  • 293
0

If I were doing this as a one-time thing, I'd create a set with all the keys I need to look up. Then read the file line-by line, check to see if the key exists in the set, and output the value if so.

Briefly, the algorithm is:

mySet = dictionary of keys to look up
for each line in the file
    key = parse key from line
    if key in mySet
        output key and value
end for

Since Delphi doesn't have a generic set, I'd use TDictionary and ignore the value.

The dictionary lookup is O(1), so should be very fast. Your limiting factor will be file I/O time.

I figure that'd take about 10 minutes to code up, and less than 10 minutes to run.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
  • TDictionary won't work. I already explained why. Please see the bold text in my Question. – Gabriel Sep 28 '16 at 07:04
  • @Silvester: But TDictionary should work in this case because you're using it to store the keys you want to output. My assumption is that if `"WP_000000307.1"` occurs in the small file, you want the values from all the lines in the big file that have that key. I think if you re-read my answer, you'll see that it's using TDictionary to good effect here. – Jim Mischel Sep 28 '16 at 11:31
  • When you say 'each line in file' you mean 'small file'? (I my question, I always referred to my files as 'small' and 'big' file) – Gabriel Sep 28 '16 at 11:34
  • Please note that keys are not unique in 'big file' nor in 'small file'. – Gabriel Sep 28 '16 at 11:35