ESP32 Voice Translator ES→EN

Proyecto DIY · ESP32 + I2S + Whisper + gTTS

Traductor de Voz
Español → Inglés

Habla en español, presiona el botón y escucha la traducción en inglés por la bocina. Filtro paso-banda 300–3400 Hz para voz humana, mic I2S INMP441, amplificador MAX98357A y servidor Python con Whisper + gTTS.

INMP441 I2S Mic MAX98357A Amp Filtro 300–3400Hz Whisper STT LibreTranslate gTTS WiFi HTTP

Flujo completo del sistema

🎙

Grabar

INMP441 I2S
Botón = grabar

🎛

Filtrar

Butterworth
300–3400 Hz

📡

Enviar

WAV vía WiFi
HTTP POST

🔤

STT + Traducir

Whisper + gTTS
LibreTranslate

🔊

Reproducir

MAX98357A I2S
Audio inglés

Componentes

ESP32 DevKit v1CPU

Dual core 240MHz, WiFi, 2× I2S nativo, PSRAM opcional para buffer de audio

INMP441I2S MIC

Micrófono MEMS digital I2S. 24 bits, SNR 61dB, rango 60–15kHz. Ideal para voz.

MAX98357AI2S AMP

Amplificador clase D I2S. 3.2W en 4Ω, sin cap de acoplamiento, filtra ruido digital.

Bocina 4Ω / 3WSPEAKER

Altavoz compacto. Rango 100Hz–8kHz suficiente para voz sintetizada clara.

Botón + LEDs RGBUI

Botón momentáneo para grabar. LEDs: rojo=grabando, verde=procesando, azul=reproduciendo.

PC / Raspberry PiSERVIDOR

Corre servidor Python con Whisper + LibreTranslate + gTTS. Misma red WiFi que el ESP32.

Conexiones de hardware

INMP441 → ESP32 (I2S RX)

VCC───3.3V

GND───GND

SCK (BCLK)──GPIO 26

WS (LRCLK)──GPIO 25

SD (Data)───GPIO 32

L/R───GND (mono L)

MAX98357A → ESP32 (I2S TX)

VIN───5V

GND───GND

BCLK───GPIO 27

LRC───GPIO 14

DIN───GPIO 13

SD (mute)──3.3V (siempre ON)

MAX98357A → Bocina

OUT+───Bocina + (4Ω)

OUT−───Bocina −

Conectar directamente, sin capacitor.
MAX98357A ya es amplificador diferencial.

UI — Botón y LEDs

Botón───GPIO 0 + PULLUP

LED Rojo──GPIO 2 + 220Ω

LED Verde──GPIO 4 + 220Ω

LED Azul───GPIO 5 + 220Ω

Simulador de estados

Estados del dispositivo

Sistema listo. Presiona el botón para hablar...

LED verde parpadeando = esperando

ℹCódigo completo para Arduino IDE o PlatformIO. Requiere librerías: ArduinoJson. I2S es nativo del ESP32 (no necesita librería externa).

esp32_translator.ino — Configuración

// ══ CONFIGURACIÓN — modificar estos valores ══
#include <WiFi.h>
#include <HTTPClient.h>
#include <driver/i2s.h>
#include <ArduinoJson.h>

// ── WiFi ─────────────────────────────────
const char* WIFI_SSID  = "TU_WIFI_SSID";
const char* WIFI_PASS  = "TU_WIFI_PASSWORD";
const char* SERVER_URL = "http://192.168.1.100:5000/translate";

// ── Pines I2S Micrófono INMP441 ──────────
#define MIC_SCK  26   // BCLK
#define MIC_WS   25   // LRCLK
#define MIC_SD   32   // Data IN

// ── Pines I2S Amplificador MAX98357A ─────
#define AMP_BCLK 27   // BCLK
#define AMP_LRC  14   // LRCLK
#define AMP_DIN  13   // Data OUT → DIN amp

// ── UI ───────────────────────────────────
#define BTN_PIN   0    // Botón grabar (PULLUP)
#define LED_RED   2    // Grabando
#define LED_GREEN 4    // Procesando
#define LED_BLUE  5    // Reproduciendo

// ── Audio ────────────────────────────────
#define SAMPLE_RATE     16000
#define MAX_RECORD_SEC  5
#define BUFFER_SAMPLES  (SAMPLE_RATE * MAX_RECORD_SEC)
#define BUFFER_BYTES    (BUFFER_SAMPLES * 2)

Filtro Butterworth + grabación

// ══ FILTRO DE VOZ — Butterworth paso-banda 300–3400Hz ══
struct BiquadFilter {
  float b0, b1, b2, a1, a2;
  float x1, x2, y1, y2;
};
BiquadFilter hpf, lpf; // High-pass 300Hz + Low-pass 3400Hz

void initVoiceFilter() {
  // ── HPF 300Hz @ 16kHz (Q=0.707 Butterworth) ──
  float w0 = 2*M_PI*300.0f/16000.0f;
  float alpha = sinf(w0)/(2*0.7071f);
  float c = cosf(w0), a0 = 1+alpha;
  hpf.b0 = (1+c)/(2*a0);  hpf.b1 = -(1+c)/a0;
  hpf.b2 = hpf.b0;
  hpf.a1 = -2*c/a0;   hpf.a2 = (1-alpha)/a0;

  // ── LPF 3400Hz @ 16kHz ──
  w0 = 2*M_PI*3400.0f/16000.0f;
  alpha = sinf(w0)/(2*0.7071f);
  c = cosf(w0); a0 = 1+alpha;
  lpf.b0 = (1-c)/(2*a0);  lpf.b1 = (1-c)/a0;
  lpf.b2 = lpf.b0;
  lpf.a1 = -2*c/a0;   lpf.a2 = (1-alpha)/a0;
}

inline float biquad(BiquadFilter& f, float x) {
  float y = f.b0*x + f.b1*f.x1 + f.b2*f.x2
              - f.a1*f.y1 - f.a2*f.y2;
  f.x2=f.x1; f.x1=x; f.y2=f.y1; f.y1=y;
  return y;
}

inline int16_t applyVoiceFilter(int16_t s) {
  float y = biquad(lpf, biquad(hpf, (float)s));
  return (y>32767)?32767:(y<-32768)?-32768:(int16_t)y;
}

// ══ GRABACIÓN con filtro aplicado ══
void recordChunk() {
  int32_t raw[256]; size_t br=0;
  i2s_read(I2S_NUM_0, raw, sizeof(raw), &br, 10);
  int n = br/sizeof(int32_t);
  for(int i=0; i<n && samplesRec<BUFFER_SAMPLES; i++) {
    // INMP441: dato 24b en MSB del int32 → shift para int16
    int16_t s = (int16_t)(raw[i] >> 16);
    audioBuf[samplesRec++] = applyVoiceFilter(s);
  }
}

Reproducción I2S + loop principal

// ══ REPRODUCIR audio TTS por I2S (MAX98357A) ══
void playAudioStream(const char* audioUrl) {
  HTTPClient http;
  http.begin(audioUrl);
  if(http.GET() != 200) { http.end(); return; }

  WiFiClient* stream = http.getStreamPtr();
  uint8_t skip[44]; int sk=0;           // Saltar WAV header
  while(sk<44) if(stream->available()) skip[sk++]=stream->read();

  int16_t buf[512*2]; // mono → stereo
  int remaining = http.getSize()-44;

  while(remaining>0 && stream->connected()) {
    int16_t mono[256]; int got=0;
    while(got<512 && remaining>0 && stream->available()) {
      mono[got/2] = stream->read() | (stream->read()<<8);
      got+=2; remaining-=2;
    }
    int n=got/2;
    // Duplicar mono → stereo para MAX98357A
    for(int i=0;i<n;i++){buf[i*2]=mono[i];buf[i*2+1]=mono[i];}
    size_t w; i2s_write(I2S_NUM_1, buf, n*4, &w, portMAX_DELAY);
  }
  i2s_zero_dma_buffer(I2S_NUM_1);
  http.end();
}

// ══ LOOP PRINCIPAL ══
void loop() {
  bool btn = (digitalRead(BTN_PIN) == LOW);

  if(btn && !btnWas) {       // Botón presionado
    playBeep(660, 80);
    startRecording();
  }
  if(!btn && btnWas) {       // Botón soltado
    stopRecording();
    playBeep(440, 80);
    digitalWrite(LED_GREEN, HIGH);
    if(sendAndTranslate())    // Enviar al servidor
      blinkLED(LED_GREEN, 3);
    else
      blinkLED(LED_RED, 3);
    digitalWrite(LED_GREEN, LOW);
  }
  btnWas = btn;
  if(isRecording) recordChunk();
  delay(1);
}

✓Servidor Python con Flask. Recibe el WAV del ESP32, transcribe con Whisper, traduce con LibreTranslate (local/gratis) y genera audio con gTTS.

server.py — STT + Traducción + TTS

# ═══════════════════════════════════════════════
# Servidor de traducción ESP32
# Flask + Whisper + LibreTranslate + gTTS
# ═══════════════════════════════════════════════
import os, io, uuid, time, logging
from pathlib import Path
from flask import Flask, request, jsonify, send_file
import whisper
import requests
from gtts import gTTS
import soundfile as sf
import numpy as np

AUDIO_DIR = Path("audio_tmp"); AUDIO_DIR.mkdir(exist_ok=True)
LIBRETRANSLATE_URL = "http://localhost:5100/translate"
WHISPER_MODEL = "small"  # tiny/base/small

stt_model = whisper.load_model(WHISPER_MODEL)
app = Flask(__name__)

def apply_voice_bandpass(audio, sr):
    from scipy.signal import butter, sosfilt
    nyq = sr/2
    sos = butter(4, [300/nyq, 3400/nyq], btype='band', output='sos')
    return sosfilt(sos, audio).astype(np.float32)

def translate(text):
    r = requests.post(LIBRETRANSLATE_URL,
        json={"q": text, "source": "es", "target": "en"}, timeout=10)
    r.raise_for_status()
    return r.json()["translatedText"]

def text_to_speech(text, out_path, sr_target=16000):
    mp3 = out_path.with_suffix(".mp3")
    gTTS(text=text, lang="en").save(str(mp3))
    from pydub import AudioSegment
    seg = AudioSegment.from_mp3(str(mp3)).set_channels(1)  \
            .set_frame_rate(sr_target).set_sample_width(2)
    seg.export(str(out_path), format="wav")
    mp3.unlink(missing_ok=True)

@app.route("/translate", methods=["POST"])
def translate_endpoint():
    req_id = uuid.uuid4().hex[:8]
    wav_in = AUDIO_DIR / f"{req_id}_in.wav"
    wav_in.write_bytes(request.data)

    audio, sr = sf.read(str(wav_in), dtype='float32')
    filtered = apply_voice_bandpass(audio, sr)
    wav_f = AUDIO_DIR / f"{req_id}_f.wav"
    sf.write(str(wav_f), filtered, sr)

    result = stt_model.transcribe(str(wav_f), language="es",
               task="transcribe", fp16=False, temperature=0)
    original = result["text"].strip()
    wav_in.unlink(missing_ok=True); wav_f.unlink(missing_ok=True)

    translated = translate(original)
    wav_out = AUDIO_DIR / f"{req_id}_tts.wav"
    text_to_speech(translated, wav_out)

    return jsonify({
        "original_text":   original,
        "translated_text": translated,
        "audio_url":       f"/audio/{req_id}_tts.wav"
    })

@app.route("/audio/<f>")
def serve_audio(f):
    p = AUDIO_DIR / f
    resp = send_file(str(p), mimetype="audio/wav")
    @resp.call_on_close
    def rm(): p.unlink(missing_ok=True)
    return resp

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000, threaded=True)

Instalar Python y dependencias

Python 3.10+. Ejecutar en el servidor (PC o Raspberry Pi 4).

Terminal

# Instalar dependencias Python
pip install flask openai-whisper requests gTTS \
            soundfile numpy pydub scipy

# Instalar ffmpeg (para conversión MP3→WAV)
# Linux:
sudo apt install ffmpeg
# macOS:
brew install ffmpeg

# Descargar modelo Whisper (244MB)
python3 -c "import whisper; whisper.load_model('small')"

Iniciar LibreTranslate con Docker (gratis, sin API key)

docker-compose.yml

# docker-compose.yml
services:
  libretranslate:
    image: libretranslate/libretranslate:latest
    ports: ["5100:5000"]
    environment:
      - LT_LOAD_ONLY=es,en

# Iniciar:
docker compose up -d
# Verificar:
curl http://localhost:5100/languages

Iniciar el servidor de traducción

Terminal

python3 server.py
# Salida esperada:
# Cargando modelo Whisper 'small'...
# Whisper listo.
# * Running on http://0.0.0.0:5000

# Verificar que funciona:
curl http://localhost:5000/status

Configurar el ESP32 con tu IP y SSID

En el archivo .ino, cambiar WIFI_SSID, WIFI_PASS y SERVER_URL con la IP de tu servidor (ver ipconfig o ip addr).

Flashear y probar

Subir el código al ESP32. Abrir Serial Monitor a 115200. Presionar botón → hablar en español → soltar botón → escuchar en inglés.

ℹEl filtro se aplica dos veces: en el ESP32 (en tiempo real durante la grabación) y opcionalmente en el servidor (antes de pasarlo a Whisper).

Coeficientes Butterworth — cómo se calculan

/*  Filtro biquad IIR de 2do orden
    Ecuación de diferencias:
    y[n] = b0·x[n] + b1·x[n-1] + b2·x[n-2]
             - a1·y[n-1] - a2·y[n-2]

    Cálculo de coeficientes (transformada bilineal):
    ω₀ = 2π·fc/fs      (frecuencia de corte normalizada)
    α  = sin(ω₀)/(2·Q)  Q=0.707 = Butterworth máx. plano

    High-pass (pasa-alto 300Hz):
    b0 = (1+cos(ω₀))/2a₀
    b1 = -(1+cos(ω₀))/a₀
    b2 = (1+cos(ω₀))/2a₀
    a1 = -2cos(ω₀)/a₀
    a2 = (1-α)/a₀       donde a₀ = 1+α

    Low-pass (pasa-bajo 3400Hz):
    b0 = (1-cos(ω₀))/2a₀
    b1 =  (1-cos(ω₀))/a₀
    b2 = (1-cos(ω₀))/2a₀
    a1 = -2cos(ω₀)/a₀
    a2 = (1-α)/a₀

    Cascada: entrada → HPF → LPF → salida
    Resultado: paso-banda 300Hz–3400Hz
*/

// Valores precomputados para fs=16000Hz:
// HPF 300Hz: b0=0.9833, b1=-1.9666, b2=0.9833
//            a1=-1.9664, a2=0.9669
// LPF 3400Hz: b0=0.2929, b1=0.5858, b2=0.2929
//             a1=-0.1716, a2=0.1716   (aprox.)

¿Por qué 300–3400 Hz?
El espectro de la voz humana abarca fundamentales de 85Hz (hombres) a 255Hz (mujeres) y formantes hasta 8kHz. Sin embargo, la intelligibilidad (comprensión) se logra con 300–3400Hz — es el ancho de banda telefónico estándar (G.711). Whisper entiende perfectamente la voz filtrada en este rango.

Ventajas del filtro en el ESP32:
→ Reduce ruido de fondo y EMI antes de enviar
→ Mejora la precisión de Whisper al reconocer voz
→ Menos datos irrelevantes en el WAV enviado
→ VAD (detección de voz) más confiable

Traductor de Voz Español → Inglés con esp32

Traductor de Voz
Español → Inglés

Deja una respuesta Cancelar la respuesta

Te pueden interesar