How can I map each word in the `display` field to its timestamp in Azure speech-to-text output?

Question

The Azure speech-to-text outputs have a display field in combinedRecognizedPhrases. How can I map each word in the display field to its timestamp in Azure speech-to-text output?

The Azure speech-to-text output contains word-level timestamps but only for the lexical field in combinedRecognizedPhrases.

Example of Azure speech-to-text output (partial):

        {
            "recognitionStatus": "Success",
            "channel": 0,
            "offset": "PT1M41.29S",
            "duration": "PT31.27S",
            "offsetInTicks": 1012900000,
            "durationInTicks": 312700000,
            "nBest": [
                {
                    "confidence": 0.9715241,
                    "lexical": "youtube dot com slash derek mitchell and then also live streaming behalf dot net slash mitchell 's garage so you can check me out there and then did i say histogram always say that you're gonna meet or not instagram dot com slash D mitchell design so hopefully there's links and buttons and stuff here right there anyway alright guys we're about to dive into some really cool stuff feel free to comment in the thread and i'd love to again see what you're up to and i'll try and answer your questions as we get going but let's go ahead and dive in",
                    "itn": "youtube.com/derek mitchell and then also livestreamingbehalf.net/mitchell's garage so you can check me out there and then did i say histogram always say that you're gonna meet or not instagram.com/D mitchell design so hopefully there's links and buttons and stuff here right there anyway alright guys we're about to dive into some really cool stuff feel free to comment in the thread and i'd love to again see what you're up to and i'll try and answer your questions as we get going but let's go ahead and dive in",
                    "maskedITN": "",
                    "display": "Youtube.com/derek Mitchell and then also livestreamingbehalf.net/mitchell's garage so you can check me out there and then did I say histogram. Always say that you're gonna meet or not instagram.com/D Mitchell design, so hopefully there's links and buttons and stuff here right there anyway? Alright guys, we're about to dive into some really cool stuff. Feel free to comment in the thread and I'd love to again see what you're up to and I'll try and answer your questions as we get going. But let's go ahead and dive in.",
                    "words": [
                        {
                            "word": "youtube",
                            "offset": "PT1M41.29S",
                            "duration": "PT0.41S",
                            "offsetInTicks": 1012900000,
                            "durationInTicks": 4100000,
                            "confidence": 0.9879842
                        },
                        {
                            "word": "dot",
                            "offset": "PT1M41.71S",
                            "duration": "PT0.15S",
                            "offsetInTicks": 1017100000,
                            "durationInTicks": 1500000,
                            "confidence": 0.971495
                        },
                        {
                            "word": "com",
                            "offset": "PT1M41.87S",
                            "duration": "PT0.51S",
                            "offsetInTicks": 1018700000,
                            "durationInTicks": 5100000,
                            "confidence": 0.92946804
                        },
                        {
                            "word": "slash",
                            "offset": "PT1M42.41S",
                            "duration": "PT0.73S",
                            "offsetInTicks": 1024100000,
                            "durationInTicks": 7300000,
                            "confidence": 0.930045
                        },
                        {
                            "word": "derek",
                            "offset": "PT1M43.17S",
                            "duration": "PT0.45S",
                            "offsetInTicks": 1031700000,
                            "durationInTicks": 4500000,
                            "confidence": 0.9679087
                        },
                        {
                            "word": "mitchell",
                            "offset": "PT1M43.63S",
                            "duration": "PT0.38S",
                            "offsetInTicks": 1036300000,
                            "durationInTicks": 3800000,
                            "confidence": 0.9761796
                        },
                        {
                            "word": "and",
                            "offset": "PT1M44.11S",
                            "duration": "PT0.43S",
                            "offsetInTicks": 1041100000,
                            "durationInTicks": 4300000,
                            "confidence": 0.9912365
                        },
                        {
                            "word": "then",
                            "offset": "PT1M44.55S",
                            "duration": "PT0.13S",
                            "offsetInTicks": 1045500000,
                            "durationInTicks": 1300000,
                            "confidence": 0.99012697
                        },
                        {
                            "word": "also",
                            "offset": "PT1M44.69S",
                            "duration": "PT0.29S",
                            "offsetInTicks": 1046900000,
                            "durationInTicks": 2900000,
                            "confidence": 0.98977005
                        },
                        {
                            "word": "live",
                            "offset": "PT1M44.99S",
                            "duration": "PT0.25S",
                            "offsetInTicks": 1049900000,
                            "durationInTicks": 2500000,
                            "confidence": 0.98370486
                        },
                        {
                            "word": "streaming",
                            "offset": "PT1M45.25S",
                            "duration": "PT0.55S",
                            "offsetInTicks": 1052500000,
                            "durationInTicks": 5500000,
                            "confidence": 0.9920498
                        },
                        {
                            "word": "behalf",
                            "offset": "PT1M45.83S",
                            "duration": "PT0.53S",
                            "offsetInTicks": 1058300000,
                            "durationInTicks": 5300000,
                            "confidence": 0.8917482
                        },
                        {
                            "word": "dot",
                            "offset": "PT1M46.37S",
                            "duration": "PT0.19S",
                            "offsetInTicks": 1063700000,
                            "durationInTicks": 1900000,
                            "confidence": 0.9815966
                        },
                        {
                            "word": "net",
                            "offset": "PT1M46.57S",
                            "duration": "PT0.28S",
                            "offsetInTicks": 1065700000,
                            "durationInTicks": 2800000,
                            "confidence": 0.9887448
                        },
                        {
                            "word": "slash",
                            "offset": "PT1M46.88S",
                            "duration": "PT0.7S",
                            "offsetInTicks": 1068800000,
                            "durationInTicks": 7000000,
                            "confidence": 0.98829234
                        },
                        {
                            "word": "mitchell",
                            "offset": "PT1M47.85S",
                            "duration": "PT0.41S",
                            "offsetInTicks": 1078500000,
                            "durationInTicks": 4100000,
                            "confidence": 0.98511887
                        },
                        {
                            "word": "'s",
                            "offset": "PT1M48.27S",
                            "duration": "PT0.05S",
                            "offsetInTicks": 1082700000,
                            "durationInTicks": 500000,
                            "confidence": 0.95022047
                        },
                        {
                            "word": "garage",
                            "offset": "PT1M48.33S",
                            "duration": "PT0.55S",
                            "offsetInTicks": 1083300000,
                            "durationInTicks": 5500000,
                            "confidence": 0.9919236
                        },
                        {
                            "word": "so",
                            "offset": "PT1M48.91S",
                            "duration": "PT0.13S",
                            "offsetInTicks": 1089100000,
                            "durationInTicks": 1300000,
                            "confidence": 0.9841132
                        },

The words list has word-level timestamps but only for the lexical field in combinedRecognizedPhrases.

Are these helpful? [How to get Word Level Timestamps using Azure Speech to Text and the Python SDK?](https://stackoverflow.com/questions/56842391/how-to-get-word-level-timestamps-using-azure-speech-to-text-and-the-python-sdk), [Word/phrase level timestamp consistency between recognizing and recognized result](https://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/665) and [Word/phrase level timestamp support possible?](https://github.com/jabber-tools/cognitive-services-speech-sdk-rs/issues/2) — Ecstasy, Jul 29 '22 at 04:13
@DeepDave-MT thanks, they don't see to look at the timestamp for the `display` field, but only the `lexical` field. — Franck Dernoncourt, Jul 30 '22 at 23:06

score 0 · Accepted Answer · answered Aug 22 '22 at 20:14

2 solutions:

Solution 1: From chlandsi on GitHub:

With the 3.1 version of the API (currently in preview) you can request word-level timestamps on the display form with the displayFormWordLevelTimestampsEnabled property.
Solution 2: Use Find the most likely word alignment between two strings in Python

How can I map each word in the `display` field to its timestamp in Azure speech-to-text output?

1 Answers1