Detecting duplicates and managing documents in Redis

Question

Our team uses Redis as a vector store for our Langchain application. We store text chunks as hashes with indexed keys, and fields of metadata, content, and vector. The issue arises when we are trying to identify and remove duplicates from the vector store. Our current method requires retrieving all the keys into Python, then look into the metadata field and subset by the "source" item, in order to find all the keys related to a specific document. Then, we are going back to Redis and deleting the found keys.

Is there a way to achieve this natively utilizing Redis, without the Python intermediate step?

We seek advice from experienced Redis users for efficient alternatives.

Here is a working example of our insertion script:

from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores.redis import Redis
from langchain.embeddings import OpenAIEmbeddings

## Load docs
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 100, chunk_overlap = 30)
docs = text_splitter.split_documents(documents)[:5]

## Embeddings
embeddings = OpenAIEmbeddings()

## Redis
indx = "test" 
url = "redis://localhost:6379/0"

for document in docs:
    content = document.page_content
    metadata = document.metadata
    metadata["source"] = "a"
    metadata["title"] = "b"
    metadata["tags"] = "c"
    document.metadata = metadata

rds = Redis.from_documents(
    documents = docs + docs, # intentionally added to create duplications  
    embedding = embeddings, 
    redis_url = url, 
    index_name = indx
)

Simplified literal Example:

Imagine a library where books are stored in Redis as chunks of text. Each book is divided into sections, and each section is stored as a Redis hash with the following structure:

Hash Key: indexname:hash_id
Fields:
- metadata: JSON containing book details (title, author, tags)
- content: Text content of the section
- vector: Vector representation for semantic search

Currently, when needing to update or remove a specific book, we first retrieve all keys (matching the index prefix) and match a certain book title (from the metadata dictionary), then iterate through those keys to perform the necessary actions. However, as the number of books grows, this approach seems less efficient. We are seeking advice on how to manage the Redis storage and retrieval process more effectively as our library scales.

Sound like you need to read about RediSearch, it is used for indexing, storing, querying documents, and it also supports vector similarity search — A. Guy, Aug 30 '23 at 13:52

Detecting duplicates and managing documents in Redis

Simplified literal Example:

0 Answers0