0

Given a pointer on a struct, I need to very quickly find whether it is part of a set (that I have to define/implement myself). I may consider a technique like Bloom Filter, but don't really know how to do it on a pointer.

The solution needs to work on 32 and 64 bits machine.

Edit: All those pointers (2k-5k of them) points to various random memory addresses since they target an element of a doubly linked list I have no control on. This can be rephrased as: "How to find in an element is part of a read-only doubly linked list by creating another structure aside?"

Edit 2: The doubly linked list may growth with the time, but not the set I am controlling.

Patrick Allaert
  • 1,751
  • 18
  • 44
  • 2
    explain "part of a set" ? What set? Do you mean, find out if the pointer to a struct is pointing to an element of a given array of those structs? – M.M Jun 17 '14 at 08:17
  • I will update my question to be more precise, in the mean time: all those structs (2k-5k of them) appears in a "kind of" doubly linked list I have **zero** control on, however, I need to create that *set* myself, aside the existing *DLL* – Patrick Allaert Jun 17 '14 at 08:28
  • Bloom filters are probabilistic... are you okay with that? – Dietrich Epp Jun 17 '14 at 08:48
  • You want something faster than traversing the list checking whether each node matches the pointer? If so , you could also store every candidate pointer in a data structure that is quick to search (e.g. a search tree). Of course this means you have to maintain the search tree every time that you make a modification to the linked list (or rebuild it before using it of course). – M.M Jun 17 '14 at 08:59
  • 3
    Answering the part "don't know how to do it on a pointer", cast to `(uintptr_t)` (this cast is guaranteed to be a bijection) and use whatever technique you would use on an integer – M.M Jun 17 '14 at 09:02
  • @DietrichEpp Yes, I'm ok with that, I will have *much* more false than true. Hence I will fallback on something slower, but not probabilistic, in the case it is true. – Patrick Allaert Jun 17 '14 at 09:50
  • @MattMcNabb Yes, I want something faster than traversing the whole list. A search tree is maybe what I need. The doubly linked list may grow, but the *set* will not. Regarding the cast to `uintptr_t`, that might be a very good start! Now I have to figure out what technique (probabilistic or not) to use on an integer. – Patrick Allaert Jun 17 '14 at 09:57
  • It is really not clear what you are asking. Headline "how to find a pointer in a set". Then first you say that you have a pointer that may be part of a set. Fine. But then you say that these pointers target double linked-list items, which is not related to the set. And then you suddenly ask how to find an item in the double-linked list. – Lundin Jun 17 '14 at 10:18
  • @Lundin: Let's have a linked list of 26 items: `a <-> b <-> c <-> ... <-> z`, each of this item have an address in memory (pointer): `0x0000040: a, 0x0000060: b, 0x0001010: c, ... 0x00a0b50: z`. Now I have to implement a set of pointers, e.g.: `{0x0000040, 0x0001010, 0x00a0b50}` (corresponding to: a, c and z) and need to very quickly figure whether an address is part of this set, e.g. is `0x00a0b50` part of it? => yes (z), is `0xffa0b50` part it? => no – Patrick Allaert Jun 17 '14 at 12:54
  • Can you say anything about the range of addresses of these pointers? How large is this range? – meaning-matters Jun 17 '14 at 13:03
  • @meaning-matters: unfortunately I don't know, however, I can compute it since the set will be fixed from the beginning. – Patrick Allaert Jun 17 '14 at 13:41

1 Answers1

1

From the page you referenced:

Bloom proposed the technique for applications where the amount of source data would require an impracticably large hash area in memory if "conventional" error-free hashing techniques were applied. He gave the example of a hyphenation algorithm for a dictionary of 500,000 words, out of which 90% follow simple hyphenation rules, but the remaining 10% require expensive disk accesses to retrieve specific hyphenation patterns.

If you only have 2k-5k items then I doubt you're talking about an "impracticably large hash area in memory".

In short, I'd recommend a hash table. You'd need to do one O(n) pass over the data to build the table, and then lookups would be O(1).

ams
  • 24,923
  • 4
  • 54
  • 75
  • Thanks @ams for your answer. Unfortunately I am a bit scared that this solution might be "slow". Not that *O(1)* is not fast, but rather that I have to perform this operation so many times (about 50k-400k times) per "run" and one "run" could certainly not take more that 0.1ms. I therefore need something that can be done in less than 1e-10 second. – Patrick Allaert Jun 17 '14 at 13:07
  • My current concern is that both hash tables and bloom filters requires hashing, and even if this is performed once (sometimes more with bloom filters (k)), this might be **the** bottleneck. I should have been more clear from the beginning about the performance aspect, sorry. – Patrick Allaert Jun 17 '14 at 13:12
  • @PatrickAllaert Try to figure out if what you want is theoretically possible at all. – meaning-matters Jun 17 '14 at 13:15
  • 1
    @PatrickAllaert A suitable hash function for a single integer (pointer) is not likely to be very slow. You could probably just shift off the bottom bits (that are always zero, because malloc returns aligned memory), and zero most of the top bits, and that'll be good enough. That'll give you some power-of-two table size. If the collision rate turns out to be too high, then you'd need to something smarter/slower, but it's worth a try. In any case, a little in-register jiggery-pokery with numbers is surely faster than traversing an in-memory search tree. – ams Jun 17 '14 at 14:24
  • 1
    @PatrickAllaert Incidentally, 0.1ms on a 2GHz processor is only 200k CPU cycles. Doing *anything* 400k times in that many cycles is going to be challenging! – ams Jun 20 '14 at 09:05