Raspberry Pi 5 llama.cpp: Local LLM Setup Guide

Q: How do I update llama.cpp?

Pull the latest commits and rebuild: cd llama.cpp git pull rm -rf build cmake -B build cmake --build build --config Release -j4 llama.cpp updates frequently. The API and binary names have changed across versions. Rebuilding from a clean state after a pull avoids most compatibility problems.

Q: Can I run the server permanently as a service?

Yes. Create a systemd service file pointing to the llama-server binary with your chosen model and flags. Enable it with sudo systemctl enable. The server will start at boot and restart if it crashes. Keep in mind that having a large model loaded in memory continuously will consume most of the Pi's RAM even when idle.

Last tested: Raspberry Pi OS 64-bit | April 09, 2026 | Raspberry Pi 5 (8GB)

Raspberry Pi 5 llama.cpp setup runs a quantized language model locally with no cloud account, no API key, and no data leaving the room. I set this up last week and it works. The Pi 5’s quad-core Cortex-A76 CPU is fast enough to generate 5 to 10 tokens per second with a quantized 7B model, which is slow compared to a GPU server but perfectly usable for questions, drafting, and code help. This guide covers building llama.cpp from source, downloading a GGUF model, and running your first prompt.

Key Takeaways

llama.cpp runs on Raspberry Pi 5 with a quantized GGUF model. Expect 5 to 10 tokens per second on a 7B Q4 model.
8GB Pi 5 is strongly preferred. 4GB works with aggressive swap and smaller models but will feel cramped.
Active cooling is required. llama.cpp runs the CPU hard and sustained thermal throttling kills token generation speed.
Boot from USB SSD rather than microSD. Model files are several gigabytes and load times on microSD are painful.
The build commands changed in recent llama.cpp versions. The binary is now llama-cli, not main.
Your prompts and responses stay entirely on your hardware. Nothing goes to an external server.

Llama cpp stack diagram - Raspberry Pi 5 llama.cpp

Hardware Requirements

Component	Minimum	Recommended
RAM	4GB LPDDR4X	8GB LPDDR4X
Cooling	Heatsink	Active fan case
Storage	32GB microSD (A2 rated)	64GB+ USB 3.0 SSD
Power supply	5V/3A USB-C	Official Pi 5 5V/5A adapter

The 4GB model is technically sufficient for a Q4 quantized 7B model, but it depends heavily on swap. On 8GB you have real headroom for context, swap stays quiet, and you can run slightly larger models. Active cooling matters more here than for most Pi projects. llama.cpp runs all four cores at high utilization during token generation and the board will throttle without airflow. A closed plastic case is not suitable for this workload.

For storage, boot from USB SSD if you can. Model files run 4 to 8GB each and loading a 7B model from a microSD card takes noticeably longer than from SSD. It also puts sustained read load on the card. See Booting Raspberry Pi from USB SSD for the setup. If staying on microSD, see Preventing SD Card Corruption on Raspberry Pi.

Preparing Raspberry Pi OS

Update the system

Start with Raspberry Pi OS 64-bit. Update fully before installing anything:

sudo apt update && sudo apt full-upgrade -y
sudo reboot

Install build dependencies

sudo apt install git cmake g++ build-essential -y

Expand swap space

The default 100MB swap is not enough. A 7B model under load can push RAM usage above 3.5GB on a 4GB Pi. Set swap to at least 2GB, or 4GB if you plan to experiment with larger models:

sudo nano /etc/dphys-swapfile

Find CONF_SWAPSIZE=100 and change it to:

CONF_SWAPSIZE=2048

Apply the change:

sudo systemctl restart dphys-swapfile
free -h   # confirm swap shows the new size

Heavy swap use on microSD shortens card life. For a build that runs llama.cpp regularly, this is another reason to use USB SSD for storage. See also Setting Up zram on Raspberry Pi for reducing write pressure from swap.

Building llama.cpp from Source

Clone the repository

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

Build with CMake

Recent versions of llama.cpp use a single CMake command to configure and build. Do not use the older mkdir/cd/cmake/make sequence from older guides. It no longer matches the current build system:

cmake -B build
cmake --build build --config Release -j4

The -j4 flag uses all four cores for compilation. The build takes 5 to 10 minutes on a Pi 5. You will see progress output like [ 12%] Building CXX object ... throughout. If you see errors about missing packages, go back and confirm the build dependencies installed correctly.

Confirm the build succeeded

The main inference binary is now called llama-cli, not main as older guides show. Confirm it exists and runs:

./build/bin/llama-cli --help

A usage message confirms the build is working. If you get “command not found” or “no such file,” the build did not complete successfully. Check the output from the cmake build step for errors.

Choosing and Downloading a GGUF Model

What GGUF means

GGUF stands for GGML Universal File Format. It is a binary model format designed for llama.cpp and compatible tools. It packages the model weights, tokenizer, and metadata into a single file that the inference engine can map into memory efficiently. You cannot use arbitrary model files from HuggingFace. They need to be in GGUF format, and they need to be quantized to fit on a Pi.

Model sizes and what the Pi 5 can handle

Model	Quantization	File size	RAM needed	Pi 5 4GB	Pi 5 8GB
TinyLlama 1.1B	Q4_K_M	~0.7GB	~1GB	Yes	Yes
Phi-3 Mini 3.8B	Q4_K_M	~2.3GB	~3GB	Yes	Yes
Llama 3.2 3B	Q4_K_M	~2GB	~2.5GB	Yes	Yes
Mistral 7B	Q4_K_M	~4.1GB	~5GB	With swap	Yes
Llama 3.1 8B	Q4_K_M	~4.7GB	~6GB	Tight	Yes

Q4_K_M is generally the best quantization level to start with. It gives good quality while keeping RAM use manageable. Q4_0 uses slightly less memory but produces noticeably weaker output. Q5_K_M is worth trying on 8GB for better quality at the cost of more RAM.

Where to find GGUF models

HuggingFace is the main source. Reliable uploaders for well-tested GGUF quantizations include Bartowski, unsloth, and the official model pages from Meta, Mistral, Microsoft, and Google. TheBloke was the original go-to source but stopped updating in early 2024. Search for your model name plus “GGUF” on HuggingFace to find current versions.

Download the model

Create a directory for models and download with wget. This example uses Llama 3.2 3B from a HuggingFace GGUF repository:

mkdir -p ~/llama-models
cd ~/llama-models

# Replace this URL with the actual GGUF file URL from HuggingFace
wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf

Find the direct file URL on HuggingFace by clicking the model file, then clicking the download icon and copying the link. Paste that URL into the wget command above.

Running Your First Prompt

Basic inference command

From inside the llama.cpp directory, run a prompt directly:

./build/bin/llama-cli \
  -m ~/llama-models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  -p "What is the capital of France?" \
  -n 100 \
  -t 4

Flag breakdown: -m is the model path, -p is the prompt string, -n is the maximum number of tokens to generate, and -t is the number of CPU threads to use. Use -t 4 on Pi 5 to engage all four cores.

You will see the model load, then tokens appear one by one. At the end, timing stats show tokens per second:

llama_print_timings: eval time = 12345 ms / 87 tokens (  141.90 ms per token,   7.05 tokens per second)

Using a prompt file

For longer prompts, use a file rather than the -p flag:

nano ~/prompt.txt
# Write your prompt, save with Ctrl+O, exit with Ctrl+X

./build/bin/llama-cli \
  -m ~/llama-models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  -f ~/prompt.txt \
  -n 150 \
  -t 4

Interactive chat mode

For back-and-forth conversation, use the -i flag with a system prompt:

./build/bin/llama-cli \
  -m ~/llama-models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  -i \
  --system-prompt "You are a helpful assistant." \
  -n 200 \
  -t 4

Useful optional flags

--temp 0.7 controls randomness. Lower values give more predictable output. 0.0 is deterministic.
--ctx-size 512 limits context window size to reduce RAM use on constrained hardware
--repeat-penalty 1.1 reduces repetitive output
--top-k 40 narrows the token sampling range

Performance Tuning

Thread count and tokens per second

Threads	Avg tokens/sec (Q4_K_M, 3B)	Avg tokens/sec (Q4_K_M, 7B)
1	~4	~2.5
2	~7	~4
4	~12	~7.5

Use -t 4 for maximum throughput. The Pi 5 has four performance cores and llama.cpp scales well across them. Fewer threads reduces heat and power draw but slows generation noticeably.

Monitor temperature and throttling

# Check temperature
vcgencmd measure_temp

# Watch temperature during a run
watch -n 1 vcgencmd measure_temp

# Check for throttle events after a session
vcgencmd get_throttled

A non-zero value from get_throttled means the board throttled at some point. Sustained throttling above 80 degrees C will drop tokens per second noticeably. If this is happening regularly, improve cooling before adjusting any software settings. See Raspberry Pi Randomly Reboots Under Load for a full breakdown of throttle flag values.

Quantization level trade-offs

Q4_0: smallest file, fastest load, weakest output quality
Q4_K_M: good balance of size, speed, and quality. Best starting point.
Q5_K_M: noticeably better quality, ~20 percent more RAM. Worth trying on 8GB Pi.
Q8_0: near full-precision quality, roughly double the RAM of Q4. Not practical on 4GB.

Running a Local API Server

llama.cpp includes a server binary that exposes an OpenAI-compatible REST API. This lets you send prompts from other devices on the LAN, integrate with scripts, or use compatible frontends:

./build/bin/llama-server \
  -m ~/llama-models/Llama-3.2-3B-Instruct-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -t 4 \
  -c 512

Then send a prompt from any device on the same LAN:

curl http://YOUR_PI_IP:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  }'

Keep this on your LAN only. The server has no authentication by default. Do not expose port 8080 to the internet.

Troubleshooting

CMake or g++ not found

sudo apt install git cmake g++ build-essential -y

System freezes or goes unresponsive during model load

The Pi ran out of RAM and has no swap, or swap is exhausted. Check:

free -h

If swap shows 100MB, the service restart did not apply. Run sudo systemctl restart dphys-swapfile and confirm with free -h again. If RAM and swap are both exhausted, use a smaller model or reduce context size with --ctx-size 512.

llama-cli not found after build

The binary lives at ./build/bin/llama-cli, not ./main. The binary name changed in late 2023. If you built from an older guide, clean and rebuild:

rm -rf build
cmake -B build
cmake --build build --config Release -j4

Model file not found error

ls ~/llama-models/

Confirm the filename matches exactly what you pass to -m. GGUF filenames are long and easy to mistype. Use tab completion when entering the path.

Segmentation fault on model load

Almost always a memory issue. Try in order: reduce context size with --ctx-size 512, switch to a smaller model, increase swap to 4096MB. If loading from microSD, move the model to USB SSD. Sustained reads from a degraded SD card can also cause this.

Permission denied on binary

chmod +x ./build/bin/llama-cli

What You Can Use This For

A 7B or smaller model running locally on Pi 5 is useful for specific, bounded tasks. It is not a replacement for a large hosted model. It works well for drafting short text, answering factual questions from its training data, explaining code logic, summarizing short documents, and simple back-and-forth conversation. It handles these tasks with no internet connection, no account, and no record kept anywhere outside the Pi.

It does not handle long reasoning chains, current events, or large-context tasks as well as hosted models. The token rate of 5 to 10 per second means a 200-token response takes 20 to 40 seconds. For tasks where that latency is acceptable and privacy matters, a locally hosted model on a Pi 5 is a practical tool rather than a novelty.

FAQ

Can llama.cpp run on a Raspberry Pi 4?

Yes, but performance is noticeably worse. The Pi 4’s Cortex-A72 cores are slower than the Pi 5’s Cortex-A76, and the Pi 4 lacks the Pi 5’s improved memory bandwidth. Expect 2 to 4 tokens per second on a Q4 7B model. Small models like TinyLlama or Phi-3 Mini are more practical on Pi 4.

Does llama.cpp use the Pi 5 GPU?

Not in the standard build. The VideoCore VII GPU in the Pi 5 does not have a supported GPU backend in llama.cpp at the time of writing. All inference runs on the CPU. GPU acceleration for Pi would require a Vulkan or OpenCL backend that matches the VideoCore VII, which is not yet production-ready in llama.cpp for this hardware.

Do I need internet access after setup?

No. Once you have built llama.cpp and downloaded your model, the whole system runs fully offline. The only reason to reconnect is to update llama.cpp, download a different model, or update the OS.

What is the best model for a Pi 5 4GB?

Phi-3 Mini 3.8B Q4_K_M or Llama 3.2 3B Q4_K_M. Both load comfortably within 4GB RAM without touching swap significantly, run at reasonable token rates, and produce useful output. TinyLlama 1.1B is faster but noticeably weaker in quality. Mistral 7B works on 4GB but requires swap and slows down under sustained use.

How do I update llama.cpp?

Pull the latest commits and rebuild:

cd llama.cpp
git pull
rm -rf build
cmake -B build
cmake --build build --config Release -j4

llama.cpp updates frequently. The API and binary names have changed across versions. Rebuilding from a clean state after a pull avoids most compatibility problems.

Can I run the server permanently as a service?

Yes. Create a systemd service file pointing to the llama-server binary with your chosen model and flags. Enable it with sudo systemctl enable. The server will start at boot and restart if it crashes. Keep in mind that having a large model loaded in memory continuously will consume most of the Pi’s RAM even when idle.

RasTech Raspberry Pi 5 Kit 8GB RAM with Pi 5 Case,Active Cooler,Screwdrive and Pi 5 8GB Board Included

Amazon.com

5.0

RasTech Raspberry Pi 5 Kit 8GB RAM with Pi 5 Case,Active Cooler,Screwdrive and Pi 5 8GB Board Included

CanaKit Raspberry Pi 5 Essentials Starter Kit (8GB RAM)

Amazon.com

5.0

CanaKit Raspberry Pi 5 Essentials Starter Kit (8GB RAM)

References

About the Author

Chuck Wilson has been programming and building with computers since the Tandy 1000 era. His professional background includes CAD drafting, manufacturing line programming, and custom computer design. He runs PidiyLab in retirement, documenting Raspberry Pi and homelab projects that he actually deploys and maintains on real hardware. Every article on this site reflects hands-on testing on specific hardware and OS versions, not theoretical walkthroughs.

Last tested hardware: Raspberry Pi 5 (8GB). Last tested OS: Raspberry Pi OS 64-bit.

Was this helpful?

Yes

Thanks for your feedback!