0

I have an unlimited number of 12 byte messages arriving. The content can be treated as random and structureless. (The length is only important because it is shorter than most hashes.)

I want to deduplicate them.

One method is to store the last 1,000 messages in a circular buffer and check all 1,000 message for a match before accepting a message (and inserting it into the circular buffer for future checks).

What other methods are there, which could be more CPU and memory efficient?

fadedbee
  • 42,671
  • 44
  • 178
  • 308
  • You want sequence of 12 bytes ie your message unique? – Pras Apr 03 '18 at 11:13
  • Yes, I only want to accept 12 byte messages which have not been seen recently (within the last 1,000 messages or more). – fadedbee Apr 03 '18 at 11:14
  • You can use crypto hashes, but they could be cpu intensive, you can also use hash bucket, based on hash function, in each bucket you will have less number of messages to compare – Pras Apr 03 '18 at 11:18
  • @Pras no need for hashing, the messages are shorter than good hashes. – fadedbee Apr 03 '18 at 11:20
  • *which could be more CPU and memory efficient?* -- I don't think you can have both. Buckets could speed up the search, but will of course need more memory than a simple circular buffer. –  Apr 03 '18 at 11:29
  • Take a minute to read _[the introduction](http://igm.univ-mlv.fr/~lecroq/string/)_ here. It may turn out that one of the algorithms featured may help. The implementations are in C. – ryyker Apr 03 '18 at 12:52

1 Answers1

1

12 Bytes seem quite small length. You can cast your byte array to string and then use a tree structure based on strings, via exploiting strcmp().

  1. Way to cast byte array to string

  2. Tree structure based on strings

Unless you form a skewed tree, O(logn) would be your worst case for deduplication. In such case, it is not hard to change to a self-balancing tree too.

Here my BST implementation using string type keys:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>


struct Node
{
  char *key;
  char *value;
  struct Node *left;
  struct Node *right;
};

struct Node* newNode(char *strKey,char *strValue)
{
  struct Node *tmp = (struct Node*) malloc(sizeof(struct Node));
  tmp->key = strdup(strKey);
  tmp->value = strdup(strValue);
  tmp->left = NULL;
  tmp->right = NULL;
  return tmp;
}

struct Node* insert(struct Node* node, char *newKey, char *newValue)
{
  if (node == NULL)
    return newNode(newKey,newValue);

  int comparison = strcmp(newKey,node->key); 
  if (comparison < 0)
    node->left  = insert(node->left, newKey, newValue);
  else if (comparison > 0)
    node->right = insert(node->right, newKey, newValue);  
  else
  {
    printf("Error occured while insert to BST\n");
    return NULL;
  }

  return node;
}

struct Node* deleteNode(struct Node* node, char *key2del)
{
  if (node == NULL)
    return node;

  int comparison = strcmp(key2del,node->key);
  if (comparison < 0)
    node->left = deleteNode(node->left, key2del);
  else if (comparison > 0)
    node->right = deleteNode(node->right, key2del);
  else // where deletion occurs
  {
    if (node->left == NULL)
    {
      struct Node *tmp = node->right;
      free(node);
      return tmp;
    }
    else if (node->right == NULL)
    {
      struct Node *tmp = node->left;
      free(node);
      return tmp;
    }

    struct Node *tmp = node->right;
    while(tmp->left != NULL)
      tmp = tmp->left;

    node->key = tmp->key;
    node->right = deleteNode(node->right, tmp->key);
  }

  return node;
}
CanCode
  • 154
  • 2
  • 10
  • Is this meant to work in conjunction with a buffer of the last 1,000 messages, so that the oldest can be removed when a new, unique, one is added? – fadedbee Apr 03 '18 at 13:09
  • I didnt put such limit to data structure but you can easily do that while calling insert method. Basically just limit number of nodes by 1000 and you get what you want :) – CanCode Apr 03 '18 at 15:39
  • How do you know which node to remove (the oldest) when you insert a new one, unless you maintain a separate list of all the nodes in insert order? – fadedbee Apr 03 '18 at 15:47
  • To me, best way is to keep the order in buffer(if applicable). Otherwise BST or standard self-balancing trees, like AVL or Red-Black, will not provide such feature. Hence, either keep a separate list or keep the order in the buffer. – CanCode Apr 03 '18 at 17:20
  • Thanks for your answer, but I am sceptical that maintaining a tree, in addition to maintaining the original circular buffer, would be faster than just testing the 1,000 values in the circular buffer to do a check. `malloc` is of the order of 300 CPU cycles by itself. For large message buffers, your solution would be much more efficient than a circular buffer alone. – fadedbee Apr 04 '18 at 08:41