audio, tts
Text-to-Speech Studio
Clone voices and synthesize natural speech with AI

No news found matching the criteria
Description
Text-to-Speech Service Guide
Welcome to the Problembo Text-to-Speech studio. Follow this guide to produce natural AI speech that matches your script and target voice.
1. Overview
Our TTS engine combines textual prompts with ordered voice references to deliver high-fidelity speech. The first reference sets the primary tone; additional references let you cover different emotions, pacing, or phonetics.
Key capabilities:
- Clone a speaker from short samples (5–60 seconds)
- Control rate, energy, and emotion via prompt directives
- Generate aligned metadata for further editing or lip-sync
- Support for multilingual synthesis with phoneme-level control
2. Writing Effective Prompts
The prompt controls what the speaker says and how they say it. Keep it concise but descriptive.
Prompt checklist:
- Provide the full script or bullet outline
- Specify pacing cues (
[pause 1.2s]
,faster
,softly
) - Clarify emotional state (
warm
,confident
,urgent
) - Mention pronunciation hints for names or acronyms (
"H. Q." pronounce letters
) - Add language/locale tags if different from the references
Example:
Deliver a 25s launch announcement in English with upbeat energy.
Keep sentences short, smile on key phrases, pause 1s before the price reveal.
3. Preparing Voice References
- Prefer clean studio or podcast quality audio
- Remove background music and compression artifacts
- Keep files between 5 and 60 seconds
- Upload in WAV, MP3, OGG, FLAC, AAC, M4A, or OPUS (≤50 MB)
- Order matters: slot #1 is the primary style, next slots add variations
- Use different samples to cover emotional range or tricky phonemes
4. Workflow Tips
- Upload at least one reference before submitting the prompt
- Reorder by removing and re-adding slots if needed
- Reuse the same references for multiple prompts to stay consistent
- Monitor task progress in the panel; results include audio and timing JSON
5. Troubleshooting
- Robotic results: add more nuanced instructions (breathing, pauses)
- Incorrect language: explicitly state target language and phonetics
- Noisy output: provide cleaner references or reduce background noise
- Mispronunciations: add phonetic hints or re-record the reference segment
Happy synthesizing!