Given your extremely large number of strings your choice has to focus in several points:
1. Are your indexing structures going to fit in memory?
For hashtables the answer is clearly not. Thus the access time will be much slower than O(1). Still you just need one disk access (the whole insertion process would be O(N)).
For b-tree I made some theoretical calculations, assuming b+tree (to save more space in interior nodes) and also that interior nodes get fully occupied. By this analysis it won't fit in memory:
- The usual disk page size is 4096 bytes. That is the size of one b-tree node.
- The average size of your strings is 70 bytes (if it less, the better).
- Child node address has 4 bytes.
- An interior node holds d keys and have d+1 children addresses:
**4096B = 4*(d+1)+70*d <=> d = 4096/75 => d = 54 **
* #internal nodes in memory -> #leaves nodes in disk -> #strings mapped*
0 internal nodes -> 1 leaves node -> 53 strings mapped
1 internal node -> 54 leaves nodes used (each with 53 leaves) -> 53² strings mapped
1+54 internal nodes -> 54² leaves nodes used -> 53³ strings mapped
...
...+54⁵ internal nodes -> 54⁶ leaves nodes = 53⁷ strings mapped
53⁷ > 10^12 , but 54⁵*4096 bytes > 1TB of memory
If your strings are not uniformly distributed, you can explore common prefixes. This way an internal node might be able to address more children, allowing you to save memory. BerkeleyDB has that option.
2. What kind of access are you going to employ? Large or small number of reads?
If you have large number of reads, are they random or sequential?
If your access is sequential you still might benefit of btree, because you will use cached nodes a lot (not needing disk access) and leaves are sequentially linked (b+tree). This is also great for range queries (which I think is not the case). If your access is completely random then hashtable is faster, as it always needs only one disk access and btree needs on disk access for each level stored in disk.
If you are going to make a small number of accesses, hashtable is preferable due to the fact that insertion will be always faster.
Since you know the total number of your strings you can indicate it to the hashtable, and you will not loose time in bucket scale operations (which implies all elements to be rehashed).
Note: I found something about your ukkonens suffix tree. Insertion is linear, and access is sequential also. However I only found it being used with some GBs. Here are some references about suffix tree algorithms: [ref1], [ref2] and [ref3].
Hope this helps somehow...