0

This question is a continuation of Sorting 20GB of data.

Nobody mentioned TDictionary in the answers to the previous question. Why? Will TDictionary choke with such large data?

This time the input data is different. One record is like this:

1 abc
2 00000000
3 00000000
4 00000000

In my record, the first row contains the data that has to be sorted. Rows 2, 3, 4 are not relevant for sorting. The first row could be anything between 10-2000 bytes (chars). Total number of entries could be in the tens of million range.

I am thinking in putting the first row into the Dictionary and the address in file (offset) where the record starts.

Once the dictionary is sorted, I only have to go to the offset, read the record and copy it into the new (sorted) file.

So, the question is: Is TDictionary suitable for such a large number of entries?

Community
  • 1
  • 1
Gabriel
  • 20,797
  • 27
  • 159
  • 293
  • 5
    `TDictionary` cannot be sorted. It is internally sorted by hash of the key, so it is very fast for lookups *by key*. ie : if you know you want record with key `foobar` then you can find it. If you want the nth entry, however, there is no such concept. You really need to take the previous advice and use a database to store this information. – J... Nov 17 '15 at 14:23
  • 1
    It is possible, however, to get a list of the dictionary keys, and sort them. Then look up the key in the dictionary to get the value, and use that value to copy the corresponding record. – Jim Mischel Nov 17 '15 at 14:25
  • "Tens of millions" is not a large number. `TDictionary` will support that many entries. The only potential problem will be holding tens of millions of keys that are hundreds of bytes long. – Jim Mischel Nov 17 '15 at 14:26
  • @JimMischel-Good idea: in this case, I could simply use a record. Faster, less memory... – Gabriel Nov 17 '15 at 14:26
  • 1
    @JimMischel You could do that, but sorting that many strings in a list is going to be a nightmare - doubly so if you have to do it each time you run the program. Far better to just get this data into a database and then be done with it, imo. – J... Nov 17 '15 at 14:39
  • Difficult to understand how a dictionary would help. Mergesort is the most common form of external sort. Why are you rejecting it? – David Heffernan Nov 17 '15 at 18:03
  • @DavidHeffernan-Nothing against megasort. Mergesort works by splitting the FILE in pieces. I think that I could obtain huge performance gain by loading (the relevant parts of) the file in memory. – Gabriel Nov 17 '15 at 19:24

1 Answers1

2

The reason that a dictionary was not mentioned is that it is an unordered container. Dictionaries, by way of being unordered, cannot be sorted. If you wished to order data that was held in a dictionary you would need a different container, at which point what would be the point of the dictionary?

In your previous question mergesort was recommended. This is good advice. It is ideal for external sorting. Which seems to be what you need.

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490