Speech-to-Text
AZUREAL includes a fully local speech-to-text engine that lets you dictate prompts instead of typing them. Audio is captured from your default input device, transcribed on-device via whisper.cpp, and inserted at the cursor position in the prompt or edit buffer. No audio ever leaves your machine.
How It Works
The speech pipeline runs through four stages:
- Capture – Audio is recorded from the default input device via the
cpallibrary (CoreAudio on macOS, WASAPI on Windows). Raw samples arrive as f32 at the device’s native sample rate and channel count. - Preprocessing – Multi-channel audio is mixed down to mono, then resampled to 16kHz (Whisper’s expected input rate).
- Transcription – The accumulated audio buffer is fed to whisper.cpp with
Greedy { best_of: 1 }decoding. On macOS, Metal GPU acceleration is used automatically; on Windows, CUDA GPU acceleration is used. - Insertion – The transcribed text is inserted at the current cursor position with smart spacing: a space is prepended if the cursor is not at the start of a line or immediately after a space.
Toggle Recording
Press Ctrl+S while in prompt mode or edit mode to start recording. Press
Ctrl+S again to stop recording and trigger transcription.
On Windows and Linux, the edit mode binding is Alt+S instead of
Ctrl+S (because Ctrl+S is used for file save in edit mode on those
platforms). The prompt mode binding remains Ctrl+S on all platforms.
The stop keybinding resolves from any focus state or mode while recording is
active. You do not need to navigate back to the prompt to stop – Ctrl+S will
always stop an active recording regardless of where focus currently sits.
Visual Feedback
While recording is active, two visual indicators appear:
- Magenta border – The prompt or edit buffer border turns magenta to signal that the microphone is live.
- REC / … prefix – The status area shows
RECwhile audio is being captured and...while transcription is in progress.
A progress indicator also appears in the status bar during the transcription phase, since Whisper processing takes a moment depending on the length of the recording.
Resource Efficiency
The speech subsystem is designed to consume zero resources when not in use:
- Background thread – The audio processing thread blocks on
mpsc::recv()when idle. It consumes no CPU until a recording is started. - Lazy model loading – The
WhisperContextis not created at startup. It is loaded on the first use of speech-to-text, so users who never dictate pay no memory cost.
Once loaded, the Whisper context remains in memory for the duration of the session to avoid repeated model load times.
Whisper Model
AZUREAL uses the ggml-small.en Whisper model, stored at:
~/.azureal/speech/ggml-small.en.bin
This file is approximately 466 MB. If the model file is missing, AZUREAL shows an error with download instructions. You must download the model manually before first use:
mkdir -p ~/.azureal/speech && curl -L -o ~/.azureal/speech/ggml-small.en.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-small.en.bin
The small.en model provides a good balance between transcription accuracy and
speed for English-language input.
Quick Reference
Ctrl+S Toggle recording on/off (prompt mode, or edit mode on macOS)
Alt+S Toggle recording on/off (edit mode on Windows/Linux)
Ctrl+S Stop recording from ANY focus/mode (always resolves)
| Detail | Value |
|---|---|
| Audio library | cpal (CoreAudio on macOS, WASAPI on Windows) |
| Transcription engine | whisper.cpp (Metal GPU on macOS, CUDA GPU on Windows) |
| Sample pipeline | f32 -> mono mixdown -> 16kHz resample |
| Decoding strategy | Greedy { best_of: 1 } |
| Model file | ~/.azureal/speech/ggml-small.en.bin (~466 MB) |
| Idle CPU usage | Zero (thread blocks on mpsc::recv) |