Most offline transcription apps start the same way. Run Whisper locally, maybe vibe code a UI around it, call it done. The problem is that Whisper on consumer hardware tops out around 5-15x real-time, and nobody optimizes beyond the model. The paste layer, the audio pipeline, the execution provider routing all get left on the table.

Clarity Scribe uses NVIDIA’s Parakeet TDT 0.6B for English. For other languages, it falls back to Whisper. A short dictation, “Hello, how are you?”, transcribes in 67ms on an M3 MacBook. A longer sentence runs about 143ms. A full 60-second recording completes in 1.3 seconds. Those numbers include the entire pipeline: mel spectrogram, neural inference, clipboard write, window focus, and paste into your active app.

Tested on both Apple Silicon (M3) and an RTX 3090. Electron, React, ONNX Runtime. No cloud. I collaborated with Claude Opus on this build, optimizing as much as possible for consumer hardware.


GPU Was Making the Decoder Slower

Parakeet TDT is a transducer with three parts: an encoder that processes the full mel spectrogram once, a decoder (LSTM prediction network) that runs once per emitted token, and a joiner that combines their outputs once per encoder frame. The decoder and joiner each run hundreds of times per transcription.

Profiling each component separately on different execution providers revealed something unexpected. The encoder on DirectML was expected: large matrix operations benefit from GPU parallelism. The surprise was that moving the decoder and joiner to CPU made the whole pipeline 43% faster.

Config60s Audio
DML (everything)2,283ms
Hybrid (DML encoder, CPU decoder)1,313ms
CPU (everything)4,731ms

The decoder and joiner run in a tight sequential loop. Each DirectML inference call has fixed kernel launch overhead (~1-3ms) regardless of computation size. For a single LSTM timestep, the actual compute is microseconds. The overhead dominates. CPU eliminates it.

// encoder gets GPU
const encoderSession = await ort.InferenceSession.create(encoderPath, {
    executionProviders: ['dml', 'cpu']
});

// decoder/joiner get CPU
const decoderSession = await ort.InferenceSession.create(decoderPath, {
    executionProviders: ['cpu']
});

Long Audio Silently Truncated

With hybrid routing working, I tested longer audio. A 63-second recording consistently cut off around the 40-second mark. Same cutoff point every time.

We traced it to the TDT decoder’s duration head. At each step, the joiner outputs both a token prediction and a duration prediction. The duration tells the decoder how many encoder frames to skip forward:

for (let t = 0; t < encoderOutLen; t += skip) {
    // ... run decoder + joiner
    skip = argmax(durationLogits);
}

On short audio, skip values stay in a reasonable range: 1 to 4 frames. But around the 35-40 second mark, the model hit a high-confidence silence region and emitted skip values of 8, 10, sometimes higher. These compound. Three aggressive skips in a row lose 300ms. Over a long stretch, enough frames get skipped that the decoder reaches encoderOutLen early. The fix:

if (skip > 6) skip = 6;

Six frames (~60ms at 10ms/frame) covers any legitimate speech pause. After this change, 60-second recordings transcribe fully. The bug only appears past ~30 seconds because the model hasn’t accumulated enough skip errors before that.


Paste Took Longer Than Transcription

After transcription, the app writes to the clipboard and sends Ctrl+V to the target window. The paste step felt sluggish, so we profiled it.

Root cause: two PowerShell process spawns.

powershell → load .NET CLR → JIT compile C# → P/Invoke SetForegroundWindow → exit
powershell → load .NET CLR → JIT compile C# → SendKeys("^v") → exit

Each spawn: ~500ms. Plus delays for focus timing and clipboard restore. Total: ~1,450ms of paste overhead on top of every transcription. The actual Win32 calls, SetForegroundWindow and SendInput, take about 2ms combined.

We replaced both spawns with koffi, a native FFI library that calls user32.dll directly from Node.js. After two failed attempts using EnumWindows with callback lifetime issues, the solution turned out to be simpler: when the user presses the hotkey, the target app is the foreground window. Just call GetForegroundWindow() at that moment and store the handle.

Result: 11ms focus + paste. On an M3 MacBook, a 1.7-second clip comes back in 67ms total; a 5.4-second clip in 143ms. The paste is no longer the bottleneck.


CoreML Crashes on Long Audio (macOS)

The same hybrid strategy applies on Apple Silicon: CoreML for the encoder, CPU for the decoder. Short recordings worked. An 84-second recording killed the process. EXC_BREAKPOINT (SIGTRAP) on com.apple.CoreMLNNProcessingQueue. No catchable error. Process just dies.

Chunking the audio wasn’t viable. Parakeet TDT isn’t designed for it. Unlike Whisper, there’s no context prompting or overlap deduplication between chunks.

The fix was duration-based provider routing: CoreML for short audio where it’s fastest, CPU for long audio where CoreML crashes.

if (process.platform === 'darwin' && audioDurationSeconds > 60) {
    return ['cpu'];
}
return ['coreml', 'cpu'];

Where It Landed

StageBeforeAfter
Transcription (60s audio)4,731ms, CPU-only (13.5x RT)1,313ms, hybrid DML+CPU (46.2x RT)
Transcription (23s audio)1,268ms, CPU-only (17.2x RT)854ms, hybrid DML+CPU (26.6x RT)
Paste to target app~1,450ms, PowerShell spawns11ms, native FFI
Total (60s recording)~6,481ms~1,324ms

4.5x reduction in total user-facing latency. The transcription engine got 3.6x faster. The paste step got 130x faster.

SolutionSpeedHardwareOn-Device?
Parakeet TDT (NVIDIA benchmark)3,386x RTFA100, batch=128Server
Groq Whisper Large v3299x RTFProprietary LPUCloud API
Clarity Scribe46.2x RTFRTX 3090 (consumer)Fully offline
Faster-whisper (CTranslate2)~10-20x RTFConsumer GPUYes
Whisper.cpp (Metal/CUDA)~5-15x RTFConsumer GPUYes

NVIDIA’s 3,386x is batch=128 on a $10,000 GPU. Groq’s 299x doesn’t include network round-trip. 46x on a consumer RTX 3090, fully offline, with results pasted into your active app in 11ms is a different category than server benchmarks.


This was a collaboration between me and Claude Opus from start to finish. The pattern that worked was iterating fast: benchmark, find the bottleneck, fix it, benchmark again.

The app is open source: github.com/laloquidity/clarity-scribe