Elasticsearch: Indexing .srt files

Question

1
00:02:17,440 --> 00:02:20,375
Hello Bob,

2
00:02:20,476 --> 00:02:22,501
how are you doing today?
...

Consider a standard .srt file, which contains text data with timestamp information for displaying audio at the correct interval on the client side.

I need to index this text data into Elasticsearch while retaining the timestamp information. I am currently using a custom formatter that includes the timestamps within the sentence. For example:

(137) Hello Bob, how are you doing today? (142)

This indicates that the sentence starts at second 137 and ends at second 142.

However, I'm not sure if this approach is the best way to handle the timestamps. Any help would be appreciated.

Anything useful to you in this thread? https://stackoverflow.com/questions/28431583/searching-subtitle-data-in-elasticsearch — Val, Mar 24 '23 at 08:34
I had looked at it before but it doesn't quite work for me. They are indexing the sentences which is not what I want. I need to index the entire article since I will be using elasticsearch highlighting feature. — Andy, Mar 24 '23 at 08:59
Each article is around 750 words. I don't want to store each sentence in one document (as suggested in previous questions). I want one document for each article. — Andy, Mar 28 '23 at 12:19
Yes. I want each .srt file to be one document in elasticsearch. — Andy, Mar 28 '23 at 14:58

Ankit · Answer 1 · 2023-03-27T07:18:57.420

You can create a field for the start and end timestamps, and then use range queries to retrieve the relevant text data. This approach allows for more complex queries and filtering when you want to operate on timestamp-related information.

You could also consider using the Elasticsearch "date" data type for the timestamps, allowing you to perform date-based queries and aggregation on the data.

input {
  file {
    path => "/path/to/your/file.srt"
    codec => plain {
      charset => "UTF-8"
    }
  }
}

filter {
  grok {
    match => { "message" => "(?<start_timestamp>\d{2}:\d{2}:\d{2},\d{3}) --> (?<end_timestamp>\d{2}:\d{2}:\d{2},\d{3})\s+(?<text>.*)" }
  }
  date {
    match => ["start_timestamp", "HH:mm:ss,SSS"]
    target => "@timestamp"
  }
  mutate {
    remove_field => ["message"]
    convert => { "start_timestamp" => "date_time" } #conversion into a date-field
    convert => { "end_timestamp" => "date_time" }
  }
}

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "your_index_name"
    .
    .
    .
  }
}

score 0 · Answer 2 · answered Mar 28 '23 at 13:46

Another way to go is with filebeat:

filebeat.inputs:
- type: filestream
  id: srt
  paths:
    - /usr/share/filebeat/*.srt
  parsers:
    - multiline:
        type: pattern
        pattern: '^\d+$'
        negate: true
        match: after

processors:
  - dissect:
      tokenizer: "%{index}\n%{start} --> %{stop}\n%{text}"
      field: "message"
      target_prefix: "dissect"
      trim_chars: "\n"
      trim_values: "right"
  - replace:
      fields:
        - field: "dissect.start"
          pattern: "^(\\d{2}):(\\d{2}):(\\d{2}),(\\d{3})"
          replacement: "$1 h$2 m$3 s$4 ms"
        - field: "dissect.start"
          pattern: " "
          replacement: ""
        - field: "dissect.stop"
          pattern: "^(\\d{2}):(\\d{2}):(\\d{2}),(\\d{3})"
          replacement: "$1 h$2 m$3 s$4 ms"
        - field: "dissect.stop"
          pattern: " "
          replacement: ""
  - decode_duration:
      field: "dissect.start"
      format: "seconds"
  - decode_duration:
      field: "dissect.stop"
      format: "seconds"
      
output.console:
  pretty: true

Elasticsearch: Indexing .srt files

2 Answers2