How to map word level timestamps to text of a given transcript?

Question

I am currently developing a tool to visualize song lyrics. The tool computes the similarity in the phonetics of syllables and assigns a rhyme group to each syllable. Syllables belonging to the same group will be highlighted in the same color. To create an interactive visualization synchronized with the currently playing song, I require timestamps at the word or better syllable level.

I have already developed an algorithm that divides words into syllables and assigns them to groups. The results are stored in a text.json file.

To generate the timestamps, I am using the excellent library whisper-timestamped. This allows me to obtain timestamps.json, which contains an AI-generated transcript of the song lyrics along with the corresponding timestamps.

Now, my objective is to incorporate the timestamps into my text.json. However, I encounter challenges because the AI does not accurately detect every word. Therefore, I need to devise a clever approach to map the words from timestamps.json to the original words in text.json.

For reference, here is an example of a text.json and a timestamps.json of the original transcript "Hey What do you say here?":

text.json

{
    "text": "Hey What do you say here",
    "verses": [
        {
            "text": "Hey What do you say here",
            "verse_id": 0,
            "bars": [
                {
                    "bar": "Hey What do you say here",
                    "bar_id": 0,
                    "rhyme_group": null,
                    "words": [
                        {
                            "word": "Hey",
                            "word_id": 0,
                            "syllables": [
                                {
                                    "syllable": "Hey",
                                    "syllable_id": 0,
                                    "pronunciation": [
                                        "HH EY1"
                                    ],
                                    "rhyme_group": 1,
                                    "timestamp": null
                                }
                            ]
                        },
                        {
                            "word": "What",
                            "word_id": 1,
                            "syllables": [
                                {
                                    "syllable": "What",
                                    "syllable_id": 1,
                                    "pronunciation": [
                                        "W AH1 T"
                                    ],
                                    "rhyme_group": 0,
                                    "timestamp": null
                                }
                            ]
                        },
                        {
                            "word": "do",
                            "word_id": 2,
                            "syllables": [
                                {
                                    "syllable": "do",
                                    "syllable_id": 2,
                                    "pronunciation": [
                                        "D UW1"
                                    ],
                                    "rhyme_group": 2,
                                    "timestamp": null
                                }
                            ]
                        },
                        {
                            "word": "you",
                            "word_id": 3,
                            "syllables": [
                                {
                                    "syllable": "you",
                                    "syllable_id": 3,
                                    "pronunciation": [
                                        "Y UW1"
                                    ],
                                    "rhyme_group": 2,
                                    "timestamp": null
                                }
                            ]
                        },
                        {
                            "word": "say",
                            "word_id": 4,
                            "syllables": [
                                {
                                    "syllable": "say",
                                    "syllable_id": 4,
                                    "pronunciation": [
                                        "S EY1"
                                    ],
                                    "rhyme_group": 1,
                                    "timestamp": null
                                }
                            ]
                        },
                        {
                            "word": "here",
                            "word_id": 5,
                            "syllables": [
                                {
                                    "syllable": "here",
                                    "syllable_id": 5,
                                    "pronunciation": [
                                        "HH IY1 R"
                                    ],
                                    "rhyme_group": 0,
                                    "timestamp": null
                                }
                            ]
                        }
                    ]
                }
            ]
        }
    ]
}

timestamps.json

{
  "text": "Hey What do you say here",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.5,
      "end": 1.2,
      "text": " Hey!",
      "tokens": [ 25431, 2298 ],
      "temperature": 0.0,
      "avg_logprob": -0.6674491882324218,
      "compression_ratio": 0.8181818181818182,
      "no_speech_prob": 0.10241222381591797,
      "confidence": 0.51,
      "words": [
        {
          "text": "Hey!",
          "start": 0.5,
          "end": 1.2,
          "confidence": 0.51
        }
      ]
    },
    {
      "id": 1,
      "seek": 200,
      "start": 2.02,
      "end": 4.48,
      "text": " What do you say here?",
      "tokens": [ 50364, 4410, 12, 384, 631, 2630, 18146, 3610, 2506, 50464 ],
      "temperature": 0.0,
      "avg_logprob": -0.43492694334550336,
      "compression_ratio": 0.7714285714285715,
      "no_speech_prob": 0.06502953916788101,
      "confidence": 0.595,
      "words": [
        {
          "text": "What",
          "start": 2.02,
          "end": 3.78,
          "confidence": 0.441
        },
        {
          "text": "do",
          "start": 3.78,
          "end": 3.84,
          "confidence": 0.948
        },
        {
          "text": "you",
          "start": 3.84,
          "end": 4.0,
          "confidence": 0.935
        },
        {
          "text": "ray",
          "start": 4.0,
          "end": 4.14,
          "confidence": 0.347
        },
        {
          "text": "here?",
          "start": 4.14,
          "end": 4.48,
          "confidence": 0.998
        }
      ]
    }
  ],
  "language": "en"
}

There are four different scenarios to consider:

Perfect Match: In this scenario, by some miracle, the words in the original text and the AI-transcribed words match exactly. Timestamping the original text becomes a trivial in this case.
Incorrect Words Detected: Here, whenever the singer sings a word, the AI also detects a word, although it may not be the correct word. For example, "I like you" might be transcribed as "I fight you." Despite the incorrect transcription, timestamping is relatively straightforward because the timestamps for the detected words will still align. The accuracy of the transcription is not crucial as long as the timestamps remain correct.
Multiple Words Detected as One: In this situation, the AI detects multiple words of the original text as a single word. For instance, "I like you" might be transcribed as "despite you." In such cases, I only have the correct timestamps for the beginning of "I" and the end of "you" since "I" and "like" are considered a single word. Timestamping becomes more challenging here.
Single Word Detected as Multiple Words: Conversely, a single word from the original text might be detected as multiple words by the AI. For example, "despite you" might be transcribed as "I like you." In this case, I have all the timestamps I need, but it is important to recognize that "despite" should have the beginning timestamp of "I" and the ending timestamp of "like."

While these examples may seem basic, finding a suitable solution becomes more challenging with longer texts.

One approach I have considered is comparing word/sentence similarity using techniques like the Levenshtein distance and other fuzzy string matching methods. However, I still have not found a reliable method to resolve the conflicts that arise in scenarios 2-4.

Do you have any ideas how I could reliably map the AI generated timestamps to my original text?

Have you tried force aligners? This is how we timestamp words to audio in speech recognition. — Kathy Reid, Jun 29 '23 at 07:10
Thank you for your comment :) I have looked a bit into forced alignment. From my understanding that would be beneficial when aligning an original text with audio. My case I want to align my generated transcript file to the original textfile. Otherwise I cannot align the timestamps. Do you now of any forced alignment approaches that create word level timestamps? — paulpelikan, Jun 29 '23 at 12:16
haven't used it, but this might be what you're looking for: https://github.com/openai/whisper/discussions/684 — Kathy Reid, Jun 29 '23 at 22:31

How to map word level timestamps to text of a given transcript?

0 Answers0