Algorithm to Merge-Join Large files

Question

Assume that I've got four large files (too large to bring into memory even individually) that has information I need to process. I intend to produce a single application level object (Record) from each line in file #1. Files 2-4 each have additional pieces of information required to compose this Record object. For example, the file structure may be as follows:

File #1:
key, description

File #2:
key, metadata, size

File #3:
origin, rate, key

File #4:
key, startDate, endDate

Each file has a single column (of known position within a line) that represents a unique key. This key is shared across files, but there is no guarantee that each key that exists in any one file, exists in the others, meaning we'll only process the subset of keys that exist in all. The rows of the files are not sorted. Can you devise an algorithm to produce the app-level objects by processing these files?

score 2 · Answer 1 · answered Aug 11 '11 at 14:02

2

Using a key-value store database

Databases are the best tools to handle datasets larger than your memory. Put your files into a key-value store (NoSQL DB like CouchDB or Cassandra will be great). Solve your problem using key queries.

Using sort and binary search

If you can't use databases, sort your file according to the key column (this can be easily done using GNU sort). Than you can access your files in nlogn time using the key. Iterate over the largest file, and process each record using calls to the other files. This way your disk reads are likely to ba cached.

answered Aug 11 '11 at 14:02

Adam Matan

128,757
147
397
562

Hmm, I was hoping it wouldn't come to this. And what if the key that's in the file is not the same key I use to store these Record objects in my NoSQL solution? Assume I use another unique key (UUID) and they "key" in these files are only for recomposing these objects? – djunforgetable Aug 11 '11 at 14:06
You can define the key string used in your NoSQL DB. – Adam Matan Aug 11 '11 at 14:12
1

Just use the key string that is most convenient for you, or use multiple indices on the db. On another note, while NoSQL are certainly nice, I don't see any benefit over a plain SQL database in this particular application. – averell Aug 11 '11 at 14:21
A key-to-value query might be way faster in NoSQL: http://stackoverflow.com/questions/1145726/what-is-nosql-how-does-it-work-and-what-benefits-does-it-provide/1145751#1145751 – Adam Matan Aug 11 '11 at 19:16

score 1 · Answer 2 · answered Aug 11 '11 at 14:19

You could dump everything into a database (actually, a plain SQL one would be fine), and then delete the "incomplete" records.

To do it file-wise, you can do this:

Sort all files by the id key
open all sorted files
read the first record from each file
if you don't have 4 "matching" records, discard the one with the lowest id until you do
merge the 4 "matching" records
rinse and repeat

Algorithm to Merge-Join Large files

2 Answers2

Using a key-value store database

Using sort and binary search