In this tutorial, you'll learn how to access and navigate your Voice Library in just minutes. We'll cover how to customize each voice. Plus, you'll discover the Voice Cloning feature, allowing you to replicate your own voice for a truly personalized AI experience.
Creation Date: Mar 20, 2025
Created By: Alexandra Fojas
You can find the list of voices you can choose from under Configure. Scroll down the bottom and you'll find "Voice Library."
When you click the Voice Library, it will show you the Featured Voices at the top and the All the Voices at the bottom.
You can search the name of the language you're looking for. You can choose in the filters if you require a separate language.
Different voices consume credits at different rates per minute. This variation exists because we utilize three different LLM providers: Deepgram, 11Labs, Open AI and Cartesia. Each provider has its own pricing and resource requirements, which affect the credit usage for their voices. Please be mindful when selecting a voice to ensure efficient credit management.
When you clone your voice, the credit usage will also vary. To create a custom voice, simply record your voice and upload the recording to the system. The system will then generate a voice model based on your recording. Keep in mind that the credit consumption for your cloned voice may differ depending on the processing and provider used.
Best practices on recording your voice:
Once you've chosen a voice by pressing select, an option and settings button is going to appear.
There will be options to change the Voice Model, Speed, Emotion Name and Emotion Level.
Cartesia previously offered two speech models under the Sonic-1 category: Sonic-English, which specialized in English language processing, and Sonic-Multilingual, which supported multiple languages.
Recently, they introduced two new models: Sonic-2 and Sonic-Turbo. Unlike Sonic-1, both of these models can handle both English and non-English languages within a single system.
The pricing remains the same across all models, but the key difference lies in latency, which refers to the response time of the AI in processing and generating speech. According to Cartesia:
Lower latency means faster AI responses and reduced delay in Text-To-Speech (TTS) output. Based on my experience, Sonic-2 provides a good balance between speed and naturalness, making it my recommended choice.
We recommend using the Normal pace for our AI to make it sound more conversational.
You can customize your receptionist based on the main emotion it's going to convey during the call.
To further customize the AI, we have the Emotion Level.
Once you're all set, press on OK.
Similar to the process above, click on the settings and and it will show you the two options provided below.
Eleven Labs offers Flash and Turbo models, each optimized for different needs. Flash models prioritize ultra-low latency for real-time applications, with Flash v2 (English-only, <75ms latency). They are ideal for conversational AI but have slightly lower quality. Turbo models focus on lifelike speech and emotional depth, making them better for voiceovers and content creation, though they have higher latency. Turbo v2 (English-only) deliver superior quality at the cost of speed.
Start your free trial for My AI Front Desk today, it takes minutes to setup!