Skip to main content

Live + Offline Automatic Speech Recognition

To transcribe meetings live and provide search for the meeting transcriptions. With efficient processing

  • Software Development
  • Data Engg.
  • Machine Learning
Article Link
The aero lesson builder app dragging an audio component into a screen about plant cells.

The Problem

I faced the task of enabling search within meeting transcripts, spanning discussions among 2 to 4 participants, whether conducted online or offline. The aim was to seamlessly reference past discussions during live conversations within the same group.

A set of light themed components for the aero design system

Solution

The objective was to develop a foundational solution that could later evolve to address this challenge comprehensively. It served as a checkpoint to identify all essential considerations for tackling the aforementioned problem, including:

  • Real-time transcription of live conversations, regardless of their online or offline nature.
  • Retrieval of past meeting conversations within the same group.
  • Ensuring the accuracy and efficiency of reference searches.
  • Exploring various scenarios to be accounted for in the solution design process.

The homepage of the aero design system docs website linking to principles and components.

Tech Stack

  • Vector DB - Qdrant - A Rust made mature vector only database which is opensource
  • Transcription Model - Whisper, opensource and one of the top transcription models
  • Text to Text Transformer Model - Google T5 Model
  • User Interface - Gradio - Apart from it being best known for AI software interfaces, its also best when using its api client

Functional Features

  • Vector Search provided does both semantic and keyword based context retrieval
  • Provides Real-time transcription of live conversations, also provides offline transcription
  • Accuracy and efficiency of reference Searches

A drag and drop storyboard style editor for creating an adaptive lesson.

Non-Functional Features

  • Semantic and summation of long context
  • Project is Enterprise grade from ci/cd automation, infra as code, scalablity and very coosen stack
  • Hybrid vector search - Dense and Sparse
  • Realtime and efficient on resources
  • Real time live recording works at 6GB GPU memory and rest models work on CPU inference

A drag and drop storyboard style editor for creating an adaptive lesson.

Project Outcome

Achieved the actual poc to be done on live meeting transcription and learnt details accross different models including diarization (speaker recognition) and other problems which occur during live stream transcription. Also issues like very long context which is not supported by current models with summarization as well as context switching

A drag and drop storyboard style editor for creating an adaptive lesson.