50

Points:

  • We process thousands of flat files in a day, concurrently.
  • Memory constraint is a major issue.
  • We use thread for each file process.
  • We don't sort by columns. Each line (record) in the file is treated as one column.

Can't Do:

  • We cannot use unix/linux's sort commands.
  • We cannot use any database system no matter how light they can be.

Now, we cannot just load everything in a collection and use the sort mechanism. It will eat up all the memory and the program is gonna get a heap error.

In that situation, how would you sort the records/lines in a file?

Erika Gomez
  • 511
  • 1
  • 5
  • 4
  • 4
    Is there a reason you can't use a database system? DBs are made for scenarios like this because they are so efficient at sorting through large amounts of data. – keyboardP Jan 18 '10 at 16:24
  • Yes, the system has been up for 10 years and they are not willing to change it. – Erika Gomez Jan 18 '10 at 16:25
  • 11
    @Erika: how is introducing a light-weight not-installed database different from introducing a custom-written program that does whatever you wants but is not as well-tested? Both are "changing the system", technically. – Joachim Sauer Jan 18 '10 at 16:33
  • 8
    Try feeding this to an executive who doesn't know sh!t about programming. If you can sell it to him, you are my mentor! – Erika Gomez Jan 18 '10 at 16:38
  • 6
    Email him the link to this thread :D – keyboardP Jan 18 '10 at 17:32
  • 4
    if it's an embedded DB, the executive does not have to know about it – Valentin Rocher Jan 19 '10 at 08:51

15 Answers15

57

It looks like what you are looking for is external sorting.

Basically, you sort small chunks of data first, write it back to the disk and then iterate over those to sort all.

Anuj Garg
  • 584
  • 7
  • 22
phisch
  • 4,571
  • 2
  • 34
  • 52
  • 4
    From my research, what I understood is that if you have 1000 records in a file and you read 100 at a time, sort that 100 and put the sorted version in a temp file which will create 10 temp sorted files. Then read two files sequentially and create another sorted (larger now) file and delete the other two which just have been read. Continue until you have one file. SERIOUSLY? Now, say you have 10 million records in a file and you read 5000 at a time, how many temp you created and how much time it's gonna cost you to get the final version? – Erika Gomez Jan 18 '10 at 16:45
  • An external sort is always slower compared to an in-memory sorty, but you are no longer limited by your ram. if speed is important to you and you have several machines at hand, take a look at hadoop (mentioned in other replies). It does an external sort where all individual sort operations can happen on many machines in parallel. – phisch Jan 18 '10 at 16:55
  • Erika: When you merge the (sorted, smaller) files, you can have more than two open, it's just slightly more straightforward describing the algorithm using just two temp files. But, if you need a file taht's larger than the available memory sorted, you'll have to (eventually) do it that way anyway and the merging operation is (relatively) fast, as all it needs to do is keep N file pointers open and find the lowest of N "next record"s to know what to emit next. I guess the critical piece of tuning is choosing how many records to keep in each temporary file. – Vatine Jan 18 '10 at 17:14
15

As other mentionned, you can process in steps.
I would like to explain this with my own words (differs on point 3) :

  1. Read the file sequentially, process N records at a time in memory (N is arbitrary, depending on your memory constraint and the number T of temporary files that you want).

  2. Sort the N records in memory, write them to a temp file. Loop on T until you are done.

  3. Open all the T temp files at the same time, but read only one record per file. (Of course, with buffers). For each of these T records, find the smaller, write it to the final file, and advance only in that file.


Advantages:

  • The memory consumption is as low as you want.
  • You only do the double of disk accesses comparing to a everything-in-memory policy. Not bad! :-)

Exemple with numbers:

  1. Original file with 1 million records.
  2. Choose to have 100 temp files, so read and sort 10 000 records at a time, and drop these in their own temp file.
  3. Open the 100 temp file at a time, read the first record in memory.
  4. Compare the first records, write the smaller and advance this temp file.
  5. Loop on step 5, one million times.

EDITED

You mentionned a multi-threaded application, so I wonder ...

As we seen from these discussions on this need, using less memory gives less performance, with a dramatic factor in this case. So I could also suggest to use only one thread to process only one sort at a time, not as a multi-threaded application.

If you process ten threads, each with a tenth of the memory available, your performance will be miserable, much much less than a tenth of the initial time. If you use only one thread, and queue the 9 other demands and process them in turn, you global performance will be much better, you will finish the ten tasks much faster.


After reading this response : Sort a file with huge volume of data given memory constraint I suggest you consider this distribution sort. It could be huge gain in your context.

The improvement over my proposal is that you don't need to open all the temp files at once, you only open one of them. It saves your day! :-)

Community
  • 1
  • 1
KLE
  • 23,689
  • 4
  • 56
  • 62
  • @Erika Well, it is an example, so that we get the idea. There is a choice to be made between the temp file size, and the number of them. – KLE Jan 19 '10 at 08:34
14

In spite of your restriction, I would use embedded database SQLITE3. Like yourself, I work weekly with 10-15 millions of flat file lines and it is very, very fast to import and generate sorted data, and you only need a little free of charge executable (sqlite3.exe). For example: Once you download the .exe file, in a command prompt you can do this:

C:> sqlite3.exe dbLines.db
sqlite> create table tabLines(line varchar(5000));
sqlite> create index idx1 on tabLines(line);
sqlite> .separator '\r\n'
sqlite> .import 'FileToImport' TabLines

then:

sqlite> select * from tabLines order by line;

or save to a file:
sqlite> .output out.txt
sqlite> select * from tabLines order by line;
sqlite> .output stdout
Narendra Pathai
  • 41,187
  • 18
  • 82
  • 120
Eduardo
  • 221
  • 1
  • 5
13

You can read the files in smaller parts, sort these and write them to temporrary files. Then you read two of them sequentially again and merge them to a bigger temporary file and so on. If there is only one left you have your file sorted. Basically that's the Megresort algorithm performed on external files. It scales quite well with aribitrary large files but causes some extra file I/O.

Edit: If you have some knowledge about the likely variance of the lines in your files you can employ a more efficient algorithm (distribution sort). Simplified you would read the original file once and write each line to a temporary file that takes only lines with the same first char (or a certain range of first chars). Then you iterate over all the (now small) temporary files in ascending order, sort them in memory and append them directly to the output file. If a temporary file turns out to be too big for sorting in memory, you can reapeat the same process for this based on the 2nd char in the lines and so on. So if your first partitioning was good enough to produce small enough files, you will have only 100% I/O overhead regardless how large the file is, but in the worst case it can become much more than with the performance wise stable merge sort.

x4u
  • 13,877
  • 6
  • 48
  • 58
  • From my research, what I understood is that if you have 1000 records in a file and you read 100 at a time, sort that 100 and put the sorted version in a temp file which will create 10 temp sorted files. Then read two files sequentially and create another sorted (larger now) file and delete the other two which just have been read. Continue until you have one file. SERIOUSLY? Now, say you have 10 million records in a file and you read 5000 at a time, how many temp you created and how much time it's gonna cost you to get the final version? – Erika Gomez Jan 18 '10 at 16:43
  • You do the merging by taking the two smallest temporary files and merge them to one larger temporary file. This causes log2(n) times more file I/O operations than sorting everything in memory (n is the number of temporary files you started with). So for 8 parts at the beginning this will be 300% I/O overhead, for 128 parts it will be 700%. – x4u Jan 18 '10 at 17:11
8

I would spin up an EC2 cluster and run Hadoop's MergeSort.

Edit: not sure how much detail you would like, or on what. EC2 is Amazon's Elastic Compute Cloud - it lets you rent virtual servers by the hour at low cost. Here is their website.

Hadoop is an open-source MapReduce framework designed for parallel processing of large data sets. A job is a good candidate for MapReduce when it can be split into subsets that can be processed individually and then merged together, usually by sorting on keys (ie the divide-and-conquer strategy). Here is its website.

As mentioned by the other posters, external sorting is also a good strategy. I think the way I would decide between the two depends on the size of the data and speed requirements. A single machine is likely going to be limited to processing a single file at a time (since you will be using up available memory). So look into something like EC2 only if you need to process files faster than that.

Community
  • 1
  • 1
danben
  • 80,905
  • 18
  • 123
  • 145
3

You could use the following divide-and-conquer strategy:

Create a function H() that can assign each record in the input file a number. For a record r2 that will be sorted behind a record r1 it must return a larger number for r2 than for r1. Use this function to partition all the records into separate files that will fit into memory so you can sort them. Once you have done that you can just concatenate the sorted files to get one large sorted file.

Suppose you have this input file where each line represents a record

Alan Smith
Jon Doe
Bill Murray
Johnny Cash

Lets just build H() so that it uses the first letter in the record so you might get up to 26 files but in this example you will just get 3:

<file1>
Alan Smith

<file2>
Bill Murray

<file10>
Jon Doe
Johnny Cash

Now you can sort each individual file. Which would swap "Jon Doe" and "Johnny Cash" in <file10>. Now, if you just concatenate the 3 files you'll have a sorted version of the input.

Note that you divide first and only conquer (sort) later. However, you make sure to do the partitioning in a way that the resulting parts which you need to sort don't overlap which will make merging the result much simpler.

The method by which you implement the partitioning function H() depends very much on the nature of your input data. Once you have that part figured out the rest should be a breeze.

VoidPointer
  • 17,651
  • 15
  • 54
  • 58
  • 1
    I know this is an old answer but concatenating the 3 sorted files will not always lead to a a sorted version..Please readers do not consider this a valid anwser. – Adel Ben Hamadi Nov 24 '20 at 15:12
2

If your restriction is only to not use an external database system, you could try an embedded database (e.g. Apache Derby). That way, you get all the advantages of a database without any external infrastructure dependencies.

FRotthowe
  • 3,662
  • 25
  • 31
  • 1
    Any solution you find that does not put strain on your VM's heap space has to be based on some notion of intermediate file storage. So basically you start to implement your own database. Thus you might just use an existing one that is known to work. – VoidPointer Jan 18 '10 at 16:59
1

Here is a way to do it without heavy use of sorting in-side Java and without using DB. Assumptions : You have 1TB space and files contain or start with unique number, but are unsorted

Divide the files N times.

Read those N files one by one, and create one file for each line/number

Name that file with corresponding number.While naming keep a counter updated to store least count.

Now you can already have the root folder of files marked for sorting by name or pause your program to give you the time to fire command on your OS to sort the files by names. You can do it programmatically too.

Now you have a folder with files sorted with their name, using the counter start taking each file one by one, put numbers in your OUTPUT file, close it.

When you are done you will have a large file with sorted numbers.

MayurRB
  • 11
  • 2
0

You can use SQL Lite file db, load the data to the db and then let it sort and return the results for you. Advantages: No need to worry about writing the best sorting algorithm. Disadvantage: You will need disk space, slower processing. https://sites.google.com/site/arjunwebworld/Home/programming/sorting-large-data-files

user2071703
  • 305
  • 3
  • 8
0

I know you mentioned not using a database no matter how light... so, maybe this is not an option. But, what about hsqldb in memory... submit it, sort it by query, purge it. Just a thought.

PaulP1975
  • 528
  • 1
  • 5
  • 12
  • I write programs which get deployed in the production server. That server is handled by some other team in some other countries. I don't have the direct access to the server! – Erika Gomez Jan 18 '10 at 16:52
  • You don't need access to the server... try using the embedded option. I used am embedded hsqldb to map database IDs when migrating data from 1 db to another and I couldn't maintain my IDs from the original. It worked really well... the performance was surprisingly good. – PaulP1975 Jan 18 '10 at 17:11
  • but if you use the embedded database only in-memory, the data will still need to fit into the available memory. I'm sure hsqldb can work with a temporary storage file so that would still work. Just wanted to point out that running purely in memory cannot be an option. – VoidPointer Jan 31 '10 at 21:13
0

You can do it with only two temp files - source and destination - and as little memory as you want. On first step your source is the original file, on last step the destination is the result file.

On each iteration:

  • read from the source file into a sliding buffer a chunk of data half size of the buffer;
  • sort the whole buffer
  • write to the destination file the first half of the buffer.
  • shift the second half of the buffer to the beginning and repeat

Keep a boolean flag that says whether you had to move some records in current iteration. If the flag remains false, your file is sorted. If it's raised, repeat the process using the destination file as a source.

Max number of iterations: (file size)/(buffer size)*2

  • If I understand this algorithm correctly, it's some sort of a bubble sort of the sorted group. Suppose the datasize is `P*K` with K as the buffer size, then the complexity is roughly `P^2 K log K`, quadratic in P. If P is going to be big, you're better off with merge sort strategy shown in the other answers (like the one by KLE), even if you have to compromise some locality (how good is the disk cache anyway?). – syockit Mar 23 '21 at 02:08
0

You could download gnu sort for windows: http://gnuwin32.sourceforge.net/packages/coreutils.htm Even if that uses too much memory, it can merge smaller sorted files as well. It automatically uses temp files.

There's also the sort that comes with windows within cmd.exe. Both of these commands can specify the character column to sort by.

js2010
  • 23,033
  • 6
  • 64
  • 66
0

File sort software for big file https://github.com/lianzhoutw/filesort/ . It is based on file merge sort algorithm.

  • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/late-answers/30844079) – Procrastinator Jan 21 '22 at 07:22
0

Try using bigsort, it supports sorting very big file/stream both ascending and descending. It also supports other operations such as shuffle and uniq. You can install using the command

pip install bigsort

and then sort with the following command

cat unsorted.txt | bigsort > sorted.txt

hugh
  • 2,237
  • 1
  • 12
  • 25
by he
  • 1
  • 4
-3

If you can move forward/backward in a file (seek), and rewrite parts of the file, then you should use bubble sort.

You will have to scan lines in the file, and only have to have 2 rows in memory at the moment, and then swap them if they are not in the right order. Repeat the process until there are no files to swap.

Jamal
  • 763
  • 7
  • 22
  • 32
user218447
  • 659
  • 5
  • 5
  • Swapping lines in a file would require you to rewrite the entire file between the two lines you're swapping. That is, of course, unless you're sorting a file with fixed line lengths. – Mark Nov 21 '15 at 00:24