3

Goal:

I would like to implement a function which has a sequence of inputs "X1,...,Xn" and outputs an ordered list "Xp,..,Xq" where all the elements are distinct but ordered.

Requirement:

  • For every Xi in the sequence "X1,...,Xn", it is a 256 bits long string.
  • The sequence of inputs "X1,...,Xn" may have the same elements, which means that there may exist two elements Xi and Xj to satisfy Xi=Xj.
  • For the same elements in sequence of "X1,...,Xn", we only output one element in the ordered list.
  • The speed of function should be as fast as possible. And it does not matter that how much storage volume is used in function.
  • The size of sequence "X1,...,Xn" is n, and n is a number that no more than 10,000.

My idea:

  • I use an Array to store the sequence which is initially empty.

  • When inputting Xi, first I search the Hashtable to judge if Xi is already in the above array. If yes, just dropping it. And if not, add Xi to Hashtable and Array.

  • If having inputting all the element of the sequence "X1,...,Xn", I sort the array and output it.

Question:

  • And with Merkle Patricia Tree (Ethereum) and Hashtable, which one should I choose?
  • For Merkle Patricia Tree (Ethereum) and Hashtable, which one has faster search speed?
  • Or is there a better data structure to satisfy this function?
steveluscher
  • 4,144
  • 24
  • 42
Xing Chang
  • 61
  • 6
  • 1
    You might consider just implementing it with Hashtable to see if it's fast enough. If you need it to be faster, then experiment with other implementations. – Jim Mischel Apr 03 '18 at 17:21
  • If Balanced Binary Tree is also the consideration, is it faster than Hashtable? Whether the cost of Balanced Binary Tree is higher than that of Hashtable? – Xing Chang Apr 04 '18 at 04:21
  • There's no way to answer this without implementing both and trying because the performance difference is likely to be fairly small, and details of the environment, data, and implementation will determine the winner. E.g. if the strings will have high entropy, you can make the hash table faster by sampling only a subset of bits to compute hash values. The hash table will also be smaller than any radix tree, so memory hierarchy effects will be in its favor. – Gene Sep 03 '22 at 02:51

1 Answers1

1

If you want the fastest look up nothing can beat the hashtable but hashtable is not good for ordering. merkle patricia trie allows us to verify data integrity in large datasets. "Patricia" stands for "Practical Algorithm To Retrieve Information Coded In Alphanumeric". Since blockchains include sensitive financial transactions, "merkle trees" are heavily used in blockchains. In your question you are worried about data integrity because inputs are in order and each input in sequence might be same or might contain similar elements. That sounds like perfect use case for merkle-patricia tree

Yilmaz
  • 35,338
  • 10
  • 157
  • 202