Interesting question. Here's my take on it.
In essence, the subtitles "don't know" about each other — meaning that it'd be best to contain the previous and subsequent subtitle text in each doc (n - 1
, n
, n + 1
) whenever applicable.
As such, you'd be gunning for a doc structure similar to:
{
"sub_id" : 0,
"start" : "00:02:17,440",
"end" : "00:02:20,375",
"text" : "Senator, we're making our final",
"overlapping_text" : "Senator, we're making our final approach into Coruscant."
}
To arrive at such a doc structure I used the following (inspired by this excellent answer):
from itertools import groupby
from collections import namedtuple
def parse_subs(fpath):
# "chunk" our input file, delimited by blank lines
with open(fpath) as f:
res = [list(g) for b, g in groupby(f, lambda x: bool(x.strip())) if b]
Subtitle = namedtuple('Subtitle', 'sub_id start end text')
subs = []
# grouping
for sub in res:
if len(sub) >= 3: # not strictly necessary, but better safe than sorry
sub = [x.strip() for x in sub]
sub_id, start_end, *content = sub # py3 syntax
start, end = start_end.split(' --> ')
# ints only
sub_id = int(sub_id)
# join multi-line text
text = ', '.join(content)
subs.append(Subtitle(
sub_id,
start,
end,
text
))
es_ready_subs = []
for index, sub_object in enumerate(subs):
prev_sub_text = ''
next_sub_text = ''
if index > 0:
prev_sub_text = subs[index - 1].text + ' '
if index < len(subs) - 1:
next_sub_text = ' ' + subs[index + 1].text
es_ready_subs.append(dict(
**sub_object._asdict(),
overlapping_text=prev_sub_text + sub_object.text + next_sub_text
))
return es_ready_subs
Once the subtitles are parsed, they can be ingested into ES. Before that's done, set up the following mapping so that your timestamps are properly searchable and sortable:
PUT my_subtitles_index
{
"mappings": {
"properties": {
"start": {
"type": "text",
"fields": {
"as_timestamp": {
"type": "date",
"format": "HH:mm:ss,SSS"
}
}
},
"end": {
"type": "text",
"fields": {
"as_timestamp": {
"type": "date",
"format": "HH:mm:ss,SSS"
}
}
}
}
}
}
Once that's done, proceed to ingest:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from utils.parse import parse_subs
es = Elasticsearch()
es_ready_subs = parse_subs('subs.txt')
actions = [
{
"_index": "my_subtitles_index",
"_id": sub_group['sub_id'],
"_source": sub_group
} for sub_group in es_ready_subs
]
bulk(es, actions)
Once ingested, you can target the original subtitle text
and boost it if it directly matches your phrase. Otherwise, add a fallback on the overlapping
text which'll ensure that both "overlapping" subtitles are returned.
Before returning, you can make sure that the hits are ordered by the start
, ascending. That kind of defeats the purpose of boosting but if you do sort, you can specify track_scores:true
in the URI to make sure the originally calculated scores are returned too.
Putting it all together:
POST my_subtitles_index/_search?track_scores&filter_path=hits.hits
{
"query": {
"bool": {
"should": [
{
"match_phrase": {
"text": {
"query": "final approach",
"boost": 2
}
}
},
{
"match_phrase": {
"overlapping_text": {
"query": "final approach"
}
}
}
]
}
},
"sort": [
{
"start.as_timestamp": {
"order": "asc"
}
}
]
}
yields:
{
"hits" : {
"hits" : [
{
"_index" : "my_subtitles_index",
"_type" : "_doc",
"_id" : "0",
"_score" : 6.0236287,
"_source" : {
"sub_id" : 0,
"start" : "00:02:17,440",
"end" : "00:02:20,375",
"text" : "Senator, we're making our final",
"overlapping_text" : "Senator, we're making our final approach into Coruscant."
},
"sort" : [
137440
]
},
{
"_index" : "my_subtitles_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 5.502407,
"_source" : {
"sub_id" : 1,
"start" : "00:02:20,476",
"end" : "00:02:22,501",
"text" : "approach into Coruscant.",
"overlapping_text" : "Senator, we're making our final approach into Coruscant. Very good, Lieutenant."
},
"sort" : [
140476
]
}
]
}
}