I Cloned Ray Dalio's Voice to Read His Own Viral Article
Ray Dalio posted a long-form article on X. It hit over 50 million views in the first day. Everyone was talking about it.
I didn’t have time to sit down and read it. But I still wanted the substance, the kind of thing you’d want to absorb on a run or during a commute, not stare at on a screen. So I thought: what if I could just listen to it? And then a more interesting idea: what if I could clone Ray Dalio’s voice and have him read his own article to me?
So I built that.
Bookmarked @RayDalio's latest but never got around to reading it?
— Lalo (@Laloquidity) February 18, 2026
I built a TTS pipeline to turn it into a ~45 min narrated video.
Listen to Ray read it to you on your commute. https://t.co/FT4s4Y0BU4 pic.twitter.com/uL40mEsEaN
Finding his voice
Chatterbox, the voice cloning model from Resemble AI, needs about 10 to 30 seconds of clean speech to learn a voice. I scrubbed through an interview with Dalio, listening for a segment where he spoke uninterrupted, without background noise or crosstalk, with a consistent cadence the model could learn from.
I pulled a clip from 2:28 to 2:52. It had 13% silence, which is relatively low. I’d tried a different clip earlier with 22% silence and the output sounded noticeably worse. Same model, same settings, same text. The reference audio quality turned out to matter more than any model parameter.
Making it sound natural
The first version worked but had problems. TTS models choke on certain text patterns. Periods inside abbreviations like “e.g.” and “U.S.” create unnatural pauses. Numbered lists break the speech rhythm. ALL-CAPS headers get shouted.
The pipeline includes a preprocessor that handles 47 of these fixes before the text reaches the model. “E.g.” becomes “for example.” “U.S.” becomes “US.” “1.” at the start of a line becomes “First:”. ALL-CAPS titles convert to title case with added pauses.
The inter-sentence silence gap went from 300ms down to 150ms. A silence trimmer detects and shortens any excessive pauses the model inserts. On the final run, that trimmer removed 210 seconds of dead air, an 8.3% reduction.
The full pipeline
The complete system chains six steps into a single command:
Preprocess the text. Generate voice-cloned audio sentence by sentence using Chatterbox. Trim the excess silence. Extract word-level timestamps using Whisper. Build a captioned video with real-time word highlighting at 1920x1080. Apply a 0.85x speed adjustment to slow the delivery slightly.
The pipeline processed 287 segments on an Apple Silicon GPU, producing about 42 minutes of narrated audio. Whisper extracted 6,912 words across 946 segments. The final video came out to 77 MB, 45 minutes and 17 seconds. Total pipeline runtime was around 130 minutes.
Then I open-sourced it
After getting the Dalio video right, the project became a proper Python package called Articulate. Ten modules, a unified CLI with seven subcommands, 16 unit tests, cross-platform support.
articulate run --voice recording.mp3 --text article.txt runs the full pipeline. articulate find-ref podcast.mp3 automatically finds the best reference clip from any audio file using sliding-window analysis. Each step can also run independently.
The project also includes an auto reference finder so nobody has to scrub through interviews manually. It runs a 20-second window across the entire recording, scores each segment on silence percentage, energy consistency, and peak-to-RMS ratio, and extracts the best candidate.
The repo is at github.com/laloquidity/articulate. Python 3.10+, GPU-accelerated on CUDA and Apple Silicon, MIT licensed.