Transcrição de áudio — IA Whisper gratuita

Transcreva ficheiros de áudio para texto com Whisper AI. 100% no navegador — sem upload.

Drop audio to transcribe

MP3, WAV, OGG, FLAC, M4A • Max 100MB

🔒100% Private — Your files never leave your device. All processing runs locally in your browser.

How It Works

Powered by OpenAI's Whisper model running locally via WebAssembly, this tool transcribes spoken audio into text with near-human accuracy. The AI model loads once (~40MB for the base model) and processes your audio entirely on your device — no data is sent to any server, making it one of the only truly private transcription tools available online.

Upload an audio file (MP3, WAV, OGG, FLAC, M4A) or video file (MP4, WebM), select the spoken language or let the AI auto-detect it, and receive a timestamped transcript within seconds. The model recognizes over 99 languages and handles accents, dialects, and mixed-language audio with remarkable accuracy.

Export your transcript in three formats: plain text (for documents), SRT (for video subtitles), or VTT (for web video captioning). The SRT and VTT formats include precise timestamps that sync with the original audio, so you can import them directly into video editors or YouTube for automatic captioning.

Unlike cloud transcription services (Google Speech-to-Text, AWS Transcribe, Rev.com), Whisper running locally means there's no per-minute billing, no audio upload, and no privacy concerns. The AI model runs in your browser's WebAssembly sandbox, processing audio data that never touches the network.

Features

✓OpenAI Whisper AI — state-of-the-art speech recognition with near-human accuracy
✓Supports 99+ languages with automatic language detection
✓Timestamped output with word-level precision for subtitle generation
✓Export as plain text, SRT subtitles, or VTT web captions
✓Runs entirely locally — your audio stays private on your device
✓Accepts audio files (MP3, WAV, OGG, FLAC, M4A) and video files (MP4, WebM)
✓Model loads once (~40MB) and caches permanently for instant subsequent use
✓Zero per-minute billing, zero accounts, unlimited free transcriptions

How to Use This Tool

Upload an audio or video file (MP3, WAV, OGG, FLAC, M4A, MP4, WebM). The tool automatically extracts audio from video files.
Select the spoken language from the dropdown, or leave it on auto-detect. Manual language selection produces slightly faster and more accurate results.
Click "Transcribe" and wait for the AI to process your audio. Processing time depends on audio length — typically 10-15x real-time speed on modern devices.
Review the timestamped transcript, then copy as plain text or export as SRT/VTT subtitle files. The files are ready for immediate use in video editors, YouTube, or web players.

⚡Powered by OpenAI Whisper — trained on 680,000 hours of multilingual audio, achieving near-human accuracy across 99+ languages. The same model powers transcription in ChatGPT.

Perfect For

→YouTubers and content creators generating SRT subtitle files for video accessibility compliance and improved SEO discoverability
→Journalists and reporters transcribing recorded interviews, press conferences, and phone calls for article research and fact-checking
→Students converting recorded lectures, seminars, and study group discussions into searchable text notes for exam preparation
→Podcasters creating show notes, episode transcripts, and blog posts from audio recordings for website content and accessibility
→Researchers transcribing qualitative interview data, focus group recordings, and field notes for academic analysis
→Legal professionals transcribing depositions, client meetings, and courtroom proceedings for case documentation
→Corporate teams transcribing meeting recordings, webinars, and training sessions for minutes and knowledge management
→Accessibility teams creating captions for corporate video content to meet ADA and WCAG compliance requirements

Under the Hood

The tool runs OpenAI's Whisper model compiled to WebAssembly via whisper.cpp, a high-performance C++ port of the original Python model. The model architecture is a Transformer encoder-decoder: the encoder processes mel-spectrogram features extracted from the audio (80 mel filter banks at 16kHz sample rate), and the decoder autoregressively generates text tokens with timestamp embeddings.

Whisper was trained on 680,000 hours of multilingual audio data collected from the internet, making it one of the most broadly trained speech recognition models in existence. The training data covers diverse accents, recording conditions, languages, and domains — which is why it generalizes well to real-world audio that often challenges specialized ASR systems.

Unlike cloud transcription APIs (Google Speech-to-Text at $0.006/15sec, AWS Transcribe at $0.024/min), all inference runs locally in the browser's WebAssembly sandbox. This means zero API costs regardless of volume, complete data privacy (audio never leaves the device), and no rate limiting. The tradeoff is processing speed — local inference on consumer hardware is slower than dedicated cloud GPU clusters.

Frequently Asked Questions

The Whisper model achieves near-human accuracy on clear audio with minimal background noise. OpenAI reports word error rates (WER) of 5-10% for English, which is comparable to professional human transcription services. Accuracy varies by language, audio quality, and speaker clarity — noisy recordings, heavy accents, and multiple simultaneous speakers reduce accuracy.

More Free Tools

Explore our other browser-based conversion tools

Audio ConverterConvert between MP3, WAV, OGG, FLAC, AAC

Image to SVG VectorizerConvert PNG & JPG to scalable vector graphics

Background RemoverRemove image backgrounds with AI — locally

PDF MergeCombine multiple PDFs into one file

PDF SplitSplit PDF by pages, ranges, or equal parts

Image CompressorReduce image file size without quality loss