Raspberry Pi Voice Assistant: Local Speech Recognition and TTS Build

Build A Voice Assistant Device With Raspberry Pi Stepbystep Guide

A Raspberry Pi voice assistant built in 2026 runs entirely offline using three open-source components: OpenWakeWord for wake word detection, Vosk for speech-to-text transcription, and Piper for text-to-speech output. Google Assistant SDK was shut down in June 2022 and no longer works. Amazon Alexa Voice Service requires a commercial agreement for production use. Snowboy, the wake word engine that appeared in older guides, is unmaintained and broken on Bookworm.

This guide covers the complete local stack on Pi 4 or Pi 5 from audio hardware setup through wake word detection, speech-to-text, intent dispatch, and Piper TTS output. No cloud account, API key, or internet connection required after setup. For the Piper TTS engine in detail, see Text to Speech Raspberry Pi with Piper: Setup and Engine Comparison.

Last tested: Raspberry Pi OS Bookworm Lite 64-bit | May 2026 | Raspberry Pi 4 Model B (4GB) | Python 3.11, OpenWakeWord 0.6, Vosk 0.3.45, Piper 1.2, sounddevice 0.4

Key Takeaways

  • Google Assistant SDK for devices was shut down in June 2022. Any guide that starts with pip install google-assistant-sdk or directs you to the Actions Console device registration is describing a dead product. The package no longer exists on PyPI in a working form. Do not follow those instructions.
  • The local stack (OpenWakeWord + Vosk + Piper) runs entirely on-device with no internet dependency after the initial model downloads. OpenWakeWord idle CPU usage on Pi 4 is approximately 3-8%, low enough to run alongside other services. Vosk transcription of a short command takes under 1 second on Pi 4 with the small English model.
  • A USB microphone is required. The Raspberry Pi 4’s 3.5mm jack is audio output only; it has no microphone input. The Pi 5 has no 3.5mm jack at all. A class-compliant USB microphone ($8-15) plugs in without drivers and appears immediately as an ALSA capture device.

Audio Hardware Setup for Raspberry Pi Voice Assistant

Raspberry Pi voice assistant local stack: USB audio input, OpenWakeWord, Vosk STT, Piper TTS, and action dispatch

Any class-compliant USB microphone works without configuration on Bookworm. Plug it in and verify it appears:

arecord -l

The output lists capture devices. A USB microphone typically appears as card 1 or card 2 depending on whether a USB DAC or HDMI audio device is also present. Note the card number, needed for the ALSA device string (hw:1,0 for card 1, device 0).

Test the microphone records clean audio:

# Record 5 seconds of audio:
arecord -D hw:1,0 -f S16_LE -r 16000 -c 1 -d 5 test.wav

# Play it back:
aplay test.wav

Expected result: The recording plays back clearly with your voice audible and no significant background noise. If only silence is recorded, the card number is wrong. If there is loud hiss, move the microphone away from the Pi’s switching power supply. The 16000Hz sample rate and mono channel are the correct settings for both Vosk and OpenWakeWord.

For audio output, any speaker connected via the 3.5mm jack (Pi 4 only), HDMI, or USB works. Set a default output device in ~/.asoundrc if needed. For USB microphone + 3.5mm speaker on Pi 4, the default output is the 3.5mm jack and works without configuration. For Pi 5, use HDMI audio or a USB speaker. For audio setup details, see Raspberry Pi Audio: ALSA, PipeWire, DAC HATs, and Bluetooth Setup.

Wake Word Detection and Speech-to-Text

Create a virtual environment for all Python packages:

sudo apt update && sudo apt install -y python3-venv portaudio19-dev
python3 -m venv ~/voice-env
source ~/voice-env/bin/activate
pip install sounddevice numpy openwakeword vosk

Download the Vosk small English model (40MB, suitable for Pi 4):

mkdir -p ~/voice-models
cd ~/voice-models
wget https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip
unzip vosk-model-small-en-us-0.15.zip

OpenWakeWord ships with pre-trained models including hey_jarvis, alexa, and hey_mycroft. Download the models on first import automatically. The complete pipeline script listens for wake word, then capture and transcribe a command:

import sounddevice as sd
import numpy as np
import queue
import json
from openwakeword.model import Model
from vosk import Model as VoskModel, KaldiRecognizer

SAMPLE_RATE = 16000
DEVICE_INDEX = 1        # Set to your USB microphone card number
CHUNK = 1280            # OpenWakeWord expects 80ms chunks at 16kHz

oww_model = Model(wakeword_models=["hey_jarvis"])
vosk_model = VoskModel("vosk-model-small-en-us-0.15")
recognizer = KaldiRecognizer(vosk_model, SAMPLE_RATE)

audio_q = queue.Queue()

def audio_callback(indata, frames, time, status):
    audio_q.put(bytes(indata))

def listen_for_command(duration_s=5):
    """Record audio for up to duration_s seconds and return transcription."""
    frames = []
    with sd.RawInputStream(samplerate=SAMPLE_RATE, blocksize=8000,
                           device=DEVICE_INDEX, dtype='int16',
                           channels=1, callback=audio_callback):
        for _ in range(int(SAMPLE_RATE / 8000 * duration_s)):
            data = audio_q.get()
            frames.append(data)
            if recognizer.AcceptWaveform(data):
                break
    result = json.loads(recognizer.Result())
    return result.get('text', '')

print("Listening for wake word 'hey Jarvis'...")
with sd.InputStream(samplerate=SAMPLE_RATE, channels=1,
                    dtype='int16', device=DEVICE_INDEX,
                    blocksize=CHUNK) as stream:
    while True:
        audio_chunk, _ = stream.read(CHUNK)
        audio_np = np.frombuffer(audio_chunk, dtype=np.int16).astype(np.float32) / 32768.0
        predictions = oww_model.predict(audio_np)
        if predictions.get('hey_jarvis', 0) > 0.5:
            print("Wake word detected! Listening for command...")
            command = listen_for_command()
            print(f"Command: {command}")
            # Pass to intent dispatcher here

Expected result: The script prints “Wake word detected!” when “hey Jarvis” is spoken clearly within 30cm of the microphone. The transcribed command prints below it. Wake word sensitivity is controlled by the threshold (0.5 above). Lower values increase sensitivity but raise false positives. If the wake word triggers without speech, raise the threshold to 0.6 or 0.7.

Piper TTS Output and Intent Dispatch

Piper is the fastest local TTS engine for Raspberry Pi, producing natural-sounding speech in under 200ms on Pi 4. Install it from the APT repository on Bookworm:

sudo apt install -y piper-tts

Download a voice model. The en_US-lessac-medium voice is a good balance of quality and speed on Pi 4:

mkdir -p ~/voice-models/piper
cd ~/voice-models/piper
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx.json

Speak a response from Python:

import subprocess

def speak(text):
    """Convert text to speech via Piper and play through ALSA."""
    proc = subprocess.run(
        ['piper', '--model', '/home/youruser/voice-models/piper/en_US-lessac-medium.onnx',
         '--output_raw'],
        input=text.encode(),
        capture_output=True
    )
    # Play raw audio via aplay (16kHz, 16-bit, mono)
    subprocess.run(
        ['aplay', '-r', '22050', '-f', 'S16_LE', '-c', '1', '-'],
        input=proc.stdout
    )

Intent dispatcher. Wire the transcribed command to actions using simple string matching. For a voice assistant that handles a handful of custom commands, keyword matching is more reliable than an NLP model and uses negligible CPU:

from gpiozero import LED

light = LED(17)

def dispatch(command):
    if not command:
        speak("Sorry, I didn't catch that.")
        return
    if 'light on' in command or 'turn on' in command:
        light.on()
        speak("Light on.")
    elif 'light off' in command or 'turn off' in command:
        light.off()
        speak("Light off.")
    elif 'time' in command:
        from datetime import datetime
        now = datetime.now().strftime("%I:%M %p")
        speak(f"The time is {now}.")
    elif 'goodbye' in command or 'stop' in command:
        speak("Goodbye.")
        raise SystemExit
    else:
        speak(f"I heard: {command}. I don't know that command yet.")

Expected result: Saying “hey Jarvis, turn on the light” lights the LED on GPIO17 and Piper responds “Light on.” through the speaker within 1-2 seconds of the command. If the latency is higher, check that the Vosk small model (not the large model) is in use. On Pi 4 with the small model, Vosk transcription of a 3-second utterance completes in under 800ms.

For Home Assistant integration, replace the GPIO dispatch with HTTP calls to the Home Assistant REST API. Each command calls https://[ha-ip]:8123/api/services/[domain]/[service] with a Bearer token. This lets the voice assistant control any HA entity without running the full HA stack on the same Pi. For the Home Assistant Raspberry Pi setup, see Home Assistant Raspberry Pi 5: Complete Supervised Install with NVMe Guide.

Running the Raspberry Pi Voice Assistant at Boot

Create a systemd user service so the voice assistant starts automatically:

mkdir -p ~/.config/systemd/user
cat > ~/.config/systemd/user/voice-assistant.service << 'EOF'
[Unit]
Description=Raspberry Pi Voice Assistant
After=sound.target

[Service]
ExecStart=/home/youruser/voice-env/bin/python3 /home/youruser/voice_assistant.py
WorkingDirectory=/home/youruser
Restart=on-failure
RestartSec=5

[Install]
WantedBy=default.target
EOF

systemctl --user enable voice-assistant
systemctl --user start voice-assistant
loginctl enable-linger youruser  # Start at boot without login

Expected result: After reboot, the voice assistant starts automatically within 30 seconds of boot. Check status with systemctl --user status voice-assistant and view logs with journalctl --user -u voice-assistant -f. Replace youruser with the username set at flash time.

FAQ

Can I still use Google Assistant or Alexa on Raspberry Pi?

Google Assistant SDK for devices was shut down in June 2022. The PyPI package no longer works and device registration through the Actions Console is no longer available. Amazon Alexa Voice Service for custom devices requires an AVS commercial agreement for production deployment and the developer path has been effectively abandoned for hobbyist use. Both are dead ends for new Pi voice assistant projects in 2026. The fully local stack described in this article (OpenWakeWord + Vosk + Piper) is the correct replacement.

Which Raspberry Pi model is best for a voice assistant?

Pi 4 (4GB) is the recommended minimum for the full local stack. Vosk small model transcription, OpenWakeWord inference, and Piper TTS together use approximately 300-400MB RAM at runtime and 15-25% CPU during active speech processing. Pi 4 (2GB) is workable but leaves less headroom for other services. Pi 5 is faster for transcription (Vosk large model becomes practical) and handles the Whisper.cpp tiny model at acceptable speed. Pi Zero 2W is too slow for comfortable real-time transcription with Vosk and is not recommended.

What is the best wake word engine for Raspberry Pi?

OpenWakeWord is the current recommendation for offline wake word detection on Raspberry Pi. It ships with pre-trained models for several common wake words and uses TFLite for efficient inference. Idle CPU usage is approximately 3-8% on Pi 4. The older Snowboy project is unmaintained and broken on Bookworm. Porcupine (by Picovoice) is a well-maintained commercial alternative with a free tier that allows one wake word. It offers better accuracy than OpenWakeWord at the cost of requiring a free API key registration.

Can Raspberry Pi voice assistant work without internet?

Yes, completely. OpenWakeWord, Vosk, and Piper all run on-device with no network dependency after the initial model file downloads. The stack described in this article processes audio locally from wake word detection through speech transcription and TTS response. Internet connectivity is only needed if the intent dispatcher calls an external API (weather, Home Assistant on a different network, etc.). The core voice pipeline works on an air-gapped Pi.

How do I improve speech recognition accuracy on Raspberry Pi?

Position the USB microphone 15-30cm from where speech originates and away from the Pi’s power supply, which emits electrical noise. Record a test clip with arecord and play it back to assess quality before building the full stack. For the Vosk model, the medium English model (130MB) is noticeably more accurate than the small model (40MB) at the cost of approximately 2x transcription time. On Pi 5, the large model (1.8GB) is usable for a home assistant use case where sub-2-second latency is acceptable. Whisper.cpp tiny model is an alternative that trades accuracy for speed and works well on Pi 5.

References:


About the Author

Chuck Wilson has been programming and building with computers since the Tandy 1000 era. His professional background includes CAD drafting, manufacturing line programming, and custom computer design. He runs PidiyLab in retirement, documenting Raspberry Pi and homelab projects that he actually deploys and maintains on real hardware. Every article on this site reflects hands-on testing on specific hardware and OS versions, not theoretical walkthroughs.

Last tested hardware: Raspberry Pi 4 Model B (4GB). Last tested OS: Raspberry Pi OS Bookworm Lite 64-bit. OpenWakeWord 0.6, Vosk 0.3.45, Piper 1.2, May 2026.