What's going on here:
I created this demo as research into using AI for accessibility. In this case, I created some local processing software using OpenAI's Whisper speech recognition library. My software takes the audio stream from the video, and creates a transcript of it. Individual lines with timestamps is standard in any common captioning format, but the AI-powered transcription is able to also include precise timing for each word in the line. I export this data as a basic JSON file.
With the subtitle timing data, I'm able to display subtitles in the standard per-line format, but also use the per-word timing to light up each word as it's spoken, as you see above. This approach can be helpful for increasing comprehension for many people.
In addition to this basic functionality, I've also used the subtitle data as a search function, allowing someone to jump to various points in the video by looking for a specific word or phrase. Naturally, this is most helpful if you already know some of the content in the video.