[ENG] APRIL 1, 2026

Local Speech-to-Text at 53x Real-Time on Consumer Hardware

BY LALO · 6 MIN READ · 1,215 WORDS

Most offline transcription apps start the same way. Run Whisper locally, maybe vibe code a UI around it, call it done. The problem is that Whisper on consumer hardware tops out around 5-15x real-time, and nobody optimizes beyond the model. The paste layer, the audio pipeline, the execution provider routing all get left on the table.

Clarity Scribe uses NVIDIA’s Parakeet TDT 0.6B for English. For other languages, it falls back to Whisper. A short dictation, “Hello, how are you?”, transcribes in 67ms on an M3 MacBook. A longer sentence runs about 143ms. A full 60-second recording completes in 1.3 seconds. Those numbers include the entire pipeline: mel spectrogram, neural inference, clipboard write, window focus, and paste into your active app.

Tested on both Apple Silicon (M3) and an RTX 3090. Electron, React, ONNX Runtime. No cloud. I collaborated with Claude Opus on this build, optimizing as much as possible for consumer hardware.

GPU Was Making the Decoder Slower

Parakeet TDT is a transducer with three parts: an encoder that processes the full mel spectrogram once, a decoder (LSTM prediction network) that runs once per emitted token, and a joiner that combines their outputs once per encoder frame. The decoder and joiner each run hundreds of times per transcription.

Profiling each component separately on different execution providers revealed something unexpected. The encoder on DirectML was expected: large matrix operations benefit from GPU parallelism. The surprise was that moving the decoder and joiner to CPU made the whole pipeline 43% faster.

Config	60s Audio
DML (everything)	2,283ms
Hybrid (DML encoder, CPU decoder)	1,313ms
CPU (everything)	4,731ms

The decoder and joiner run in a tight sequential loop. Each DirectML inference call has fixed kernel launch overhead (~1-3ms) regardless of computation size. For a single LSTM timestep, the actual compute is microseconds. The overhead dominates. CPU eliminates it.

// encoder gets GPU
const encoderSession = await ort.InferenceSession.create(encoderPath, {
    executionProviders: ['dml', 'cpu']
});

// decoder/joiner get CPU
const decoderSession = await ort.InferenceSession.create(decoderPath, {
    executionProviders: ['cpu']
});

CoreML Crashes on Long Audio (macOS)

The same hybrid strategy applies on Apple Silicon: CoreML for the encoder, CPU for the decoder. Short recordings worked. An 84-second recording killed the process. EXC_BREAKPOINT (SIGTRAP) on com.apple.CoreMLNNProcessingQueue. No catchable error. Process just dies.

Chunking the audio wasn’t viable. Parakeet TDT isn’t designed for it. Unlike Whisper, there’s no context prompting or overlap deduplication between chunks.

The fix was duration-based provider routing: CoreML for short audio where it’s fastest, CPU for long audio where CoreML crashes.

if (process.platform === 'darwin' && audioDurationSeconds > 60) {
    return ['cpu'];
}
return ['coreml', 'cpu'];

Long Audio Silently Truncated

With hybrid routing working, I tested longer audio. A 63-second recording consistently cut off around the 40-second mark. Same cutoff point every time.

We traced it to the TDT decoder’s frame advancement logic. At each step, the joiner outputs both a token prediction and a duration prediction (how many frames to skip). The sherpa-onnx reference implementation uses three independent if-blocks to handle: duration skips, max-tokens-per-frame limits, and blank-with-no-skip fallbacks. Our code used an else-if chain, which meant the blank-fallback check never ran when the model predicted a skip. The decoder jumped over spoken frames.

// Bug: else-if prevented multiple conditions from firing
if (skip > 0) { tokensThisFrame = 0; }
else if (y === BLANK_ID || tokensThisFrame >= max) { skip = 1; }

// Fix: three separate if-blocks (matches sherpa-onnx DecodeOneTDT)
if (skip > 0) { tokensThisFrame = 0; }
if (tokensThisFrame >= maxTokensPerFrame) { skip = 1; }
if (y === BLANK_ID && skip === 0) { skip = 1; }

A single else vs three ifs. That was the difference between partial truncation and complete transcription.

Paste Took Longer Than Transcription

After transcription, the app writes to the clipboard and sends Ctrl+V to the target window. The paste step felt sluggish, so we profiled it.

Root cause: two PowerShell process spawns.

powershell → load .NET CLR → JIT compile C# → P/Invoke SetForegroundWindow → exit
powershell → load .NET CLR → JIT compile C# → SendKeys("^v") → exit

Each spawn: ~500ms. Plus delays for focus timing and clipboard restore. Total: ~1,450ms of paste overhead on top of every transcription. The actual Win32 calls, SetForegroundWindow and SendInput, take about 2ms combined.

We replaced both spawns with koffi, a native FFI library that calls user32.dll directly from Node.js. After two failed attempts using EnumWindows with callback lifetime issues, the solution turned out to be simpler: when the user presses the hotkey, the target app is the foreground window. Just call GetForegroundWindow() at that moment and store the handle.

Result: 2-3ms focus + paste.

Hold-to-Talk

The original interaction was tap-to-toggle: press a hotkey to start recording, press again to stop. But dictation feels more natural as push-to-talk. Hold a key, speak, release, text appears.

Building this cross-platform required global key release detection, which Electron’s built-in globalShortcut doesn’t support. It only fires on key-down. We used uiohook-napi, which provides keydown/keyup events via SetWindowsHookEx on Windows and CGEventTap on macOS. Both capture key releases even when the app is in the background.

Toggle mode still uses Electron’s globalShortcut. Hold mode uses uiohook. A single hotkeyService manages both and can switch at runtime without restart.

Filler Word Removal

Transcription models faithfully reproduce filler words: um, uh, ah, er. A regex post-processing pass strips these from output while preserving discourse markers like “you know” and “I mean” that carry conversational meaning. Hedge words (basically, actually, literally) are also left intact.

Where It Landed

Stage	Before	After
Transcription (78s audio)	4,731ms, CPU-only (13.5x RT)	1,460ms, hybrid DML+CPU (53.2x RT)
Transcription (23s audio)	1,268ms, CPU-only (17.2x RT)	854ms, hybrid DML+CPU (26.6x RT)
Paste to target app	~1,450ms, PowerShell spawns	2-3ms, native FFI
Total (78s recording)	~6,481ms	~1,463ms

4.4x reduction in total user-facing latency. The transcription engine got 3.6x faster. The paste step got 500x faster.

Solution	Speed	Hardware	On-Device?
Parakeet TDT (NVIDIA benchmark)	3,386x RTF	A100, batch=128	Server
Groq Whisper Large v3	299x RTF	Proprietary LPU	Cloud API
Clarity Scribe	53.2x RTF	RTX 3090 (consumer)	Fully offline
Clarity Scribe	~40x RTF	M3 MacBook (Apple Silicon)	Fully offline
Faster-whisper (CTranslate2)	~10-20x RTF	Consumer GPU	Yes
Whisper.cpp (Metal/CUDA)	~5-15x RTF	Consumer GPU	Yes

NVIDIA’s 3,386x is batch=128 on a $10,000 GPU. Groq’s 299x doesn’t include network round-trip. 53x on a consumer RTX 3090 and ~40x on an M3 MacBook, fully offline, with results pasted into your active app in 3ms is a different category than server benchmarks. For reference, raw Whisper on CPU sits around 5-15x real-time on the same hardware.

This was a collaboration between me and Claude Opus from start to finish. The pattern that worked was iterating fast: benchmark, find the bottleneck, fix it, benchmark again.

The app is open source: github.com/laloquidity/clarity-scribe