Semantic video search

Question

I am attempting to extend OpenAI's CLIP functionality to semantic video search. Essentially, my objective is to input a text query and get relevant video segments/clips that match the semantic content of the text query. Here's what I've thought so far:

Extract frames from the video at regular intervals.
Use CLIP to create embeddings of these frames and the text query.
Compare the text query embeddings with the video frame embeddings to find matches.

However, this approach seems quite naive, and I feel it might not effectively capture the context in the videos due to the temporal information being lost.

Can anyone share advice on improving this approach? Is there a more efficient or effective way to implement semantic video search with OpenAI's CLIP? Also, I'm wondering about any preprocessing steps, possible optimization strategies, or libraries that could be beneficial for this task.

Any help or guidance would be greatly appreciated. Thanks!

danywigglebutt · Answer 1 · 2023-07-02T17:39:49.690

Here's a simplified step-by-step:

Chunk the Video into 1-second Intervals

To divide the video into 1-second chunks, you would typically use a library like moviepy or opencv.

import cv2

video = cv2.VideoCapture('your_video.mp4')

fps = video.get(cv2.CAP_PROP_FPS)
frames = []

while(video.isOpened()):
    ret, frame = video.read()
    if ret:
        frames.append(frame)
    else:
        break

video.release()
cv2.destroyAllWindows()

# Now chunk into 1-second intervals
chunks = [frames[i:i+int(fps)] for i in range(0, len(frames), int(fps))]

Generating the Embeddings

For each 1-second chunk, a series of images are generated, and the embeddings are calculated using the OpenAI CLIP model.

import torch
import clip

model, preprocess = clip.load('ViT-B/32')

for chunk in chunks:
    # For each frame in the chunk, preprocess and convert to tensor
    images = [torch.unsqueeze(preprocess(frame), 0) for frame in chunk]

    # Stack all tensors together
    images_input = torch.cat(images, 0)

    # Generate the embedding
    with torch.no_grad():
        image_features = model.encode_image(images_input)

Performing the Search

You can use cosine similarity:

    # Calculate cosine similarity between the corpus of vectors and the query vector
    scores = util.cos_sim(query_vector, corpus_vector)[0].cpu().tolist()
    
    # Combine docs & scores
    doc_score_pairs = list(zip(docs, scores))
    
    # Sort by decreasing score
    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
    
    # Output passages & scores
    for doc, score in doc_score_pairs:
        print(score, doc)

The challenge with this approach however is treating 1 second intervals as a series of frames does not capture the context of the video. They should be treated as moving images.

Mixpeek offers a managed search API that does this:

GET: https://api.mixpeek.com/v1/search?q=people+experiencing+joy

Response:

[
  {
    "content_id": "6452f04d4c0c0888bdc6b97c",
    "metadata": {
      "file_ext": "mp4",
      "file_id": "ebc289d7-44e1-4672-bf3c-ccfa490b7k2d",
      "file_url": "https://mixpeek.s3.amazonaws.com/<user>/<file>.mp4",
      "filename": "CR-9146f0.mp4",
    },
    "score": 0.636489987373352,
    "timestamps": [
      2.5035398230088495,
      1.2517699115044247,
      3.755309734513274
    ]
  }
]

Further reading and demo: https://learn.mixpeek.com/what-is-semantic-video-search/

Semantic video search

1 Answers1