Sort multiple files into one single file

Question

I have 200 folders with up to 20 files in each folder. Total the dataset is 2gb. I tried parsing all at once and put each line into a list and sort them but i get out of memory.

What approach could I use to sort the multiple files into on single file?

Simplest solution is; add more memory to your heap size. 8 GB isn't much. My 9 year old has an old machine of mine which has 24 GB. — Peter Lawrey, Feb 06 '17 at 16:24
The point here is that there is not enough info. I would provide more detail — bichito, Feb 06 '17 at 16:27
I tried parsing all files at once putting everything into an Arraylist of objects containing the values in each line. Then i tried using sorting using collection. On lower size it works but it crashes if I want to do it with all files and folders at once. — sweep, Feb 06 '17 at 16:31
Then I would buy some memory because we don't know which kind of algorithm you are using to crash it. We don't know the memory size or the jvm arguments. Have you tried -X and increase heap size? — bichito, Feb 06 '17 at 16:35

score 1 · Answer 1 · answered Feb 06 '17 at 16:30

1

File-based merge-sort:

Sort content of each file.
Merge sort the 20 files of each folder to get one sorted file per folder.
Merge sort the 200 folder-files to get final result.

If you don't want to do a 200-way merge sort, you can split #3 into multiple merge-sorts and then merge-sort the results of those, to as many levels as needed.

answered Feb 06 '17 at 16:30

Andreas

154,647
11
152
247

Note that steps 2 and 3 aren't merge sorts, but just merges. Specifically, [k-way merge](https://en.wikipedia.org/wiki/K-Way_Merge_Algorithms). And, really, if your code can do a 20-way merge then it can do a 200-way merge. At some point you'll run out of memory for the buffers, but not at 200. – Jim Mischel Feb 06 '17 at 17:22
@JimMischel I don't believe there is a requirement for merge-sort to be limited to 2-way merges, although that is typically the case. The concept of merging two (or more) sorted subsets to product a larger sorted subset, and repeating that process until all the data is sorted, that is exactly what Merge-Sort does. I never said that step 2 is a merge sort. I was saying that the entire sequence (1-3) is a disk-based multi-way merge-sort. Sure, step 1 might be doable in-memory, and use whatever sort algorithm, but the overall concept is a merge-sort. – Andreas Feb 06 '17 at 19:03
You misunderstood my comment. I was simply saying that steps 2 and 3 are not technically sorts, but rather just merges. So rather than "Merge sort the 20 files ..." it's more correct to say "Merge the 20 files ..." As for the merges, I never mentioned doing 2-way merges. That would be horribly inefficient. The most efficient in this case would be to do a 4000-way merge after sorting the individual files. That would minimize the I/O time. But splitting it up as you recommended makes practical sense if speed isn't the highest priority. – Jim Mischel Feb 06 '17 at 20:02

score 0 · Answer 2 · edited May 23 '17 at 12:33

0

What sorting algorithm do you use? Because I think the problem lies with the algorithm; you need to take a look for a more efficient algorithm to do the sorting. I believe that for large inputs, Merge-Sort is the best (albeit with a few modifications for that size).

Here is a very similar question, take a look at the top two answers. They should help you solve the problem.

edited May 23 '17 at 12:33

Community

1
1

answered Feb 06 '17 at 16:31

Arjen

11
4

Sort multiple files into one single file

2 Answers2