Live + Offline Automatic Speech Recognition

To transcribe meetings live and provide search for the meeting transcriptions. With efficient processing

Software Development
Data Engg.
Machine Learning

The aero lesson builder app dragging an audio component into a screen about plant cells.

The Problem

I faced the task of enabling search within meeting transcripts, spanning discussions among 2 to 4 participants, whether conducted online or offline. The aim was to seamlessly reference past discussions during live conversations within the same group.

A set of light themed components for the aero design system

Solution

The objective was to develop a foundational solution that could later evolve to address this challenge comprehensively. It served as a checkpoint to identify all essential considerations for tackling the aforementioned problem, including:

Real-time transcription of live conversations, regardless of their online or offline nature.
Retrieval of past meeting conversations within the same group.
Ensuring the accuracy and efficiency of reference searches.
Exploring various scenarios to be accounted for in the solution design process.

The homepage of the aero design system docs website linking to principles and components.

Tech Stack

Vector DB - Qdrant - A Rust made mature vector only database which is opensource
Transcription Model - Whisper, opensource and one of the top transcription models
Text to Text Transformer Model - Google T5 Model
User Interface - Gradio - Apart from it being best known for AI software interfaces, its also best when using its api client

Functional Features

Vector Search provided does both semantic and keyword based context retrieval
Provides Real-time transcription of live conversations, also provides offline transcription
Accuracy and efficiency of reference Searches

A drag and drop storyboard style editor for creating an adaptive lesson.

Non-Functional Features

Semantic and summation of long context
Project is Enterprise grade from ci/cd automation, infra as code, scalablity and very coosen stack
Hybrid vector search - Dense and Sparse
Realtime and efficient on resources
Real time live recording works at 6GB GPU memory and rest models work on CPU inference

Project Outcome

Achieved the actual poc to be done on live meeting transcription and learnt details accross different models including diarization (speaker recognition) and other problems which occur during live stream transcription. Also issues like very long context which is not supported by current models with summarization as well as context switching