Here's a simplified step-by-step:
Chunk the Video into 1-second Intervals
To divide the video into 1-second chunks, you would typically use a library like moviepy
or opencv
.
import cv2
video = cv2.VideoCapture('your_video.mp4')
fps = video.get(cv2.CAP_PROP_FPS)
frames = []
while(video.isOpened()):
ret, frame = video.read()
if ret:
frames.append(frame)
else:
break
video.release()
cv2.destroyAllWindows()
# Now chunk into 1-second intervals
chunks = [frames[i:i+int(fps)] for i in range(0, len(frames), int(fps))]
Generating the Embeddings
For each 1-second chunk, a series of images are generated, and the embeddings are calculated using the OpenAI CLIP model.
import torch
import clip
model, preprocess = clip.load('ViT-B/32')
for chunk in chunks:
# For each frame in the chunk, preprocess and convert to tensor
images = [torch.unsqueeze(preprocess(frame), 0) for frame in chunk]
# Stack all tensors together
images_input = torch.cat(images, 0)
# Generate the embedding
with torch.no_grad():
image_features = model.encode_image(images_input)
Performing the Search
You can use cosine similarity:
# Calculate cosine similarity between the corpus of vectors and the query vector
scores = util.cos_sim(query_vector, corpus_vector)[0].cpu().tolist()
# Combine docs & scores
doc_score_pairs = list(zip(docs, scores))
# Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
# Output passages & scores
for doc, score in doc_score_pairs:
print(score, doc)
The challenge with this approach however is treating 1 second intervals as a series of frames does not capture the context of the video. They should be treated as moving images.
Mixpeek offers a managed search API that does this:
GET: https://api.mixpeek.com/v1/search?q=people+experiencing+joy
Response:
[
{
"content_id": "6452f04d4c0c0888bdc6b97c",
"metadata": {
"file_ext": "mp4",
"file_id": "ebc289d7-44e1-4672-bf3c-ccfa490b7k2d",
"file_url": "https://mixpeek.s3.amazonaws.com/<user>/<file>.mp4",
"filename": "CR-9146f0.mp4",
},
"score": 0.636489987373352,
"timestamps": [
2.5035398230088495,
1.2517699115044247,
3.755309734513274
]
}
]
Further reading and demo: https://learn.mixpeek.com/what-is-semantic-video-search/