Running a 70B LLM on Pure RISC-V: The MilkV Pioneer Deployment Journey
When the 40GB download completed and the model loaded into memory, I wondered: would a 70-billion parameter language model actually run on a RISC-V CPU? No GPU. No CUDA. Just 64 cores and a lot of RAM. Spoiler: it worked. Here’s the complete technical journey of deploying llama.cpp and Ollama on the MilkV Pioneer, including build challenges, architectural insights, and practical lessons for running state-of-the-art LLMs on CPU-only RISC-V hardware.
Why This Matters: LLMs Meet RISC-V
RISC-V is the underdog processor architecture. While x86 and ARM dominate the computing landscape with decades of software optimization and hardware acceleration, RISC-V is the open-source upstart. It’s the architecture you’ve probably never heard of unless you follow hardware closely. And running modern AI workloads on RISC-V? That’s even more niche.
But here’s the thing: RISC-V represents something important. It’s an open ISA (Instruction Set Architecture) that anyone can implement without licensing fees. It’s the Linux of processor architectures. And as AI becomes increasingly central to computing, the question isn’t just “Can we run LLMs on RISC-V?” but rather “How do we democratize AI hardware?”
I got my hands on a MilkV Pioneer - a 64-core RISC-V workstation - and decided to find out if it could run serious language models. Not 7B parameter toy models, but the big ones. The 70-billion parameter models that typically require expensive NVIDIA GPUs.
The equation seemed impossible at first: a 70B parameter model in FP16 precision requires roughly 140GB of memory. Our system has 125GB of RAM. Yet with quantization and smart memory management, it worked. This article documents the entire journey: the hardware choices, build challenges, surprising insights, and step-by-step instructions for anyone wanting to replicate this.
⚠️ Important: Scope and Limitations: This article documents deployment feasibility and build processes for running LLMs on RISC-V hardware. Quantitative performance benchmarks (tokens per second, latency, throughput) were not collected during this deployment. Performance estimates mentioned in this article (such as “1-5 tokens/sec”) are theoretical extrapolations based on hardware characteristics, not measured data.
Readers planning production deployments should conduct their own benchmarking with
llama-benchor similar tools to measure actual performance on their specific hardware and workloads.
Hardware Foundation: The MilkV Pioneer
Before diving into software, let’s talk about the hardware that made this possible. The MilkV Pioneer isn’t your typical development board. It’s a serious workstation-class machine designed around RISC-V processors.
System Specifications
Here’s what we’re working with:
| Component | Specification |
|---|---|
| CPU | 64-core RISC-V processor (vendor ID: 0x5b7) |
| ISA | rv64imafdcv |
| Memory | 125GB RAM + 8GB swap (122GB available after OS overhead) |
| Storage | 939GB NVMe SSD (119GB used, 820GB available) |
| OS | Fedora Linux 38 Workstation (riscv64) |
| GPU | None - pure CPU inference only |
Let me unpack that ISA specification because it’s crucial to understanding why this deployment worked.
Decoding rv64imafdcv: The Secret Sauce
The ISA string rv64imafdcv is more than just alphabet soup. Each letter represents a set of processor capabilities:
| Extension | Full Name | Significance for LLM Inference |
|---|---|---|
| rv64 | 64-bit base integer ISA | Enables addressing the full 125GB memory space for large models |
| i | Integer operations | Base integer arithmetic operations |
| m | Multiply/Divide | Critical for matrix operations in transformer architectures |
| a | Atomic instructions | Thread synchronization for parallel inference across 64 cores |
| f | Single-precision floating-point | FP32 operations for model weights and computations |
| d | Double-precision floating-point | FP64 for high-precision accumulation where needed |
| c | Compressed instructions | 16-bit instruction encoding (reduces code size) |
| v | Vector extensions | CRITICAL: SIMD operations for tensor mathematics |
The v extension is the game-changer here. Vector extensions provide SIMD (Single Instruction, Multiple Data) capabilities similar to x86 AVX or ARM NEON. Without these vector extensions, CPU-only inference would be 5-10x slower. With them, we can efficiently parallelize the matrix multiplications that dominate transformer model inference.
This is why llama.cpp’s GGML_CPU_AARCH64=ON configuration works on RISC-V: the vector extensions are architecturally similar enough to ARM NEON that the same optimized code paths function correctly. Cross-architecture code reuse for the win.
The Strategic Hardware Choice: Memory First
You might notice something unusual about this system: 125GB of RAM is massive for a workstation. Why so much?
The answer reveals our core strategy: strategic compensation. When you don’t have GPU acceleration, you compensate with other resources:
- No GPU memory? → Use abundant system RAM instead
- Slower per-core inference? → Throw 64 cores at the problem
- Limited software ecosystem? → Choose portable, well-designed software
This memory-first approach enables loading entire large models into RAM without swapping to disk. A 70B parameter model quantized to Q4_0 format requires roughly 40GB of memory. With 125GB available, we have comfortable headroom for:
- The model itself (~40GB)
- Operating system and services (~3GB)
- Inference state and KV cache (~10-20GB)
- Multiple concurrent sessions if needed
The result: no disk swapping, no memory pressure, and the ability to load models that would choke smaller systems.
Journey Part 1: Building llama.cpp
With hardware in place, the real challenge began: getting cutting-edge LLM software to build on an architecture it was never explicitly designed for.
Toolchain Exploration: Finding Solid Ground
My first instinct was to use the latest compiler. GCC 14.2.0 source build? Sure, bleeding-edge features sound great. I also explored the XUANTIE RISC-V GNU Toolchain from T-Head (Alibaba’s RISC-V division), thinking vendor-specific optimizations might help.
But sometimes newer isn’t better, especially on emerging architectures. After experimentation, I settled on the system-provided GCC 13.2.1 from Red Hat. Why?
- Proven stability: Red Hat’s testing means fewer surprises
- Ecosystem integration: Works seamlessly with Fedora’s packages
- ccache support: Dramatically speeds incremental builds
- Good enough: Compiler optimization differences rarely matter as much as architecture-specific code paths
The lesson: on cutting-edge hardware, conservative toolchain choices often win.
The Build: Surprisingly Straightforward
llama.cpp is designed for portability, and it shows. The build process was refreshingly simple:
# Clone the repository
cd ~/
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Configure with CMake
cmake -B build
# Build with all 64 cores
cmake --build build --config Release -j 64
That’s it. No special patches, no RISC-V-specific forks, no fighting with dependencies. Just standard CMake commands.
The build took advantage of all 64 cores (hence -j 64) and ccache reduced rebuild times significantly. Within minutes, we had 30+ executables in build/bin/:
$ ls -lh build/bin/ | grep llama
-rwxr-xr-x. 1 user user 1.4M Jan 28 10:23 llama-cli
-rwxr-xr-x. 1 user user 1.5M Jan 28 10:23 llama-server
-rwxr-xr-x. 1 user user 1007K Jan 28 10:23 llama-bench
-rwxr-xr-x. 1 user user 1.2M Jan 28 10:23 llama-quantize
-rwxr-xr-x. 1 user user 1.4M Jan 28 10:23 libllama.so
Configuration Deep-Dive: What Makes It Work
The real magic happens in the CMake configuration. Let’s look at the key settings from build/CMakeCache.txt:
# CPU backend configuration
GGML_CPU:BOOL=ON # CPU backend enabled
GGML_CPU_AARCH64:BOOL=ON # ← The critical setting
GGML_ACCELERATE:BOOL=ON # Generic CPU acceleration
GGML_CCACHE:BOOL=ON # Build caching for speed
# Explicitly disabled (hardware constraints)
GGML_CUDA:BOOL=OFF # No NVIDIA GPU
GGML_AVX:BOOL=OFF # x86-specific, not available
GGML_AVX2:BOOL=OFF # x86-specific, not available
GGML_AVX512:BOOL=OFF # x86-specific, not available
# Not used (simplicity)
GGML_BLAS:BOOL=OFF # No external BLAS library
# Build configuration
CMAKE_BUILD_TYPE:STRING=Release # Optimize for performance
CMAKE_C_COMPILER=/usr/lib64/ccache/cc # GCC 13.2.1 with ccache
The GGML_CPU_AARCH64=ON setting deserves special attention. This tells llama.cpp to use its ARM-optimized code paths on RISC-V. Why does this work?
RISC-V vector extensions (the v in rv64imafdcv) provide SIMD capabilities architecturally similar to ARM NEON. The compiler can map ARM intrinsics to RISC-V vector instructions, allowing code written for aarch64 to function correctly on RISC-V. This is architectural portability at its finest - well-designed abstractions enable cross-platform optimization.
Another interesting choice: GGML_BLAS=OFF. We’re not using an external BLAS (Basic Linear Algebra Subprograms) library. Why not?
Possible reasons:
- No optimized BLAS library readily available for RISC-V
- llama.cpp’s built-in GGML operations sufficient for performance
- Reducing dependencies simplifies the build and deployment
- RISC-V numerical library ecosystem not yet mature
I call this “strategic simplicity” - minimizing moving parts to ensure stability. When blazing new trails on emerging architectures, fewer dependencies mean fewer things that can break.
⚠️ Warning: The
GGML_CPU_AARCH64=ONconfiguration works on RISC-V processors with vector extensions. If your RISC-V system lacks the v extension, you’ll need different settings and should expect significantly slower performance.
Validation: Does It Actually Work?
Build success doesn’t mean functionality. Let’s test:
$ ./build/bin/llama-cli --version
llama-cli version b8595b16
$ ./build/bin/llama-bench --help
usage: llama-bench [options]
...
It runs. The binaries execute without errors, help text displays correctly, and we can move forward. Small victories matter when working with new architectures.
Journey Part 2: Deploying Ollama
llama.cpp proved that RISC-V could run LLM inference engines. But llama.cpp is developer-focused - you manage GGUF files manually, pass command-line arguments for every option, and handle model lifecycle yourself.
For actual usability, I wanted Ollama. Ollama wraps llama.cpp in a friendly interface with automatic model downloading, versioning, and an OpenAI-compatible HTTP API. It’s the difference between git and GitHub Desktop - same power, much better UX.
The First Build Attempt: A Lesson in Memory
Ollama is written in Go and wraps llama.cpp for the actual inference. My first build attempt used the same parallelism strategy that worked for llama.cpp:
cd ~/
git clone https://github.com/ollama/ollama.git
cd ollama
# Try building with all cores
make -j 64
This failed. Not with compiler errors, but with the system grinding to a halt. Memory usage spiked, swap thrashed, and the build eventually died.
What went wrong?
The Go compiler has fundamentally different memory characteristics than C/C++ compilers:
| Aspect | C/C++ (GCC) | Go |
|---|---|---|
| Compilation model | Per-file, independent | Whole-program optimization |
| Memory per job | ~500MB-1GB | ~2-4GB |
| 64 parallel jobs | ~32-64GB (fits in 125GB) | ~128-256GB (exceeds capacity) |
The math was simple: Go’s compiler maintains large in-memory representations during compilation. With 64 parallel jobs, we’d need 128-256GB of memory. Our system has 125GB. Recipe for disaster.
The Solution: Language-Specific Build Strategies
The fix was equally simple:
# Reduce parallelism for Go's memory requirements
make -j 5
With only 5 parallel jobs (~10-20GB memory), the build completed successfully:
$ ls -lh ollama
-rwxr-xr-x. 1 user user 27M Jan 28 11:45 ollama
$ ./ollama --version
ollama version is 0.5.7-6-g2ef3c80
Key lesson: On high-core-count systems, build parallelism isn’t just $(nproc). It depends on the language implementation:
-
C/C++ projects:
-j $(nproc)usually works -
Go projects:
-j $(( $(nproc) / 10 ))or less to limit memory - Rust projects: Somewhere in between (depends on project size)
This is especially relevant for RISC-V systems, which often have higher core counts to compensate for lower single-threaded performance.
💡 Tip: If you’re building Go projects on high-core-count machines, start with low parallelism and increase gradually while monitoring memory usage. The fastest build is the one that completes without thrashing.
Architecture Comparison: llama.cpp vs Ollama
Now that we have both systems built, let’s compare their trade-offs:
| Dimension | llama.cpp | Ollama |
|---|---|---|
| Implementation | C/C++ (performance-first) | Go (productivity-first) wrapping llama.cpp |
| Distribution | 30+ specialized binaries | Single unified binary (27MB) |
| Model management | Manual GGUF file handling | Automatic download and versioning |
| API | Low-level C API + CLI tools | HTTP API (OpenAI-compatible) |
| Build complexity | CMake, multiple targets | Makefile, single target (but higher memory) |
| Best for | Custom integration, embedded systems, maximum control | Rapid prototyping, user-friendly deployment, API services |
For this deployment, I wanted both: llama.cpp for understanding the low-level mechanics, and Ollama for actually using the models day-to-day.
The 70B Milestone: Pushing the Limits
With both llama.cpp and Ollama successfully built, it was time to deploy actual models. I started small to validate the setup, then went big to prove the concept.
Warmup: llama3.2 Deployment
First, a sanity check with a smaller model:
$ ./ollama pull llama3.2
pulling manifest
pulling dde5aa3fc5ffc... 100% ▕████████████████▏ 1.9 GB
pulling 966de95ca8a6... 100% ▕████████████████▏ 1.4 KB
pulling fcc5a6bec9da... 100% ▕████████████████▏ 7.7 KB
pulling a70ff7e570d9... 100% ▕████████████████▏ 6.0 KB
pulling 56bb8bd477a5... 100% ▕████████████████▏ 96 B
pulling 34bb5ab01051... 100% ▕████████████████▏ 561 B
success
$ ./ollama run llama3.2
>>> Hello! How are you?
Hello! I'm doing well, thank you for asking. I'm a large language model,
so I don't have feelings like humans do, but I'm functioning properly and
ready to help with any questions or tasks you might have. How can I assist
you today?
Success! A 2GB model downloads, loads, and responds correctly. The system works.
The Big One: deepseek-r1:70b
Now for the real test. DeepSeek-R1 is a 70-billion parameter model - one of the largest open-source language models available. The download alone is 40GB:
$ ./ollama pull deepseek-r1:70b
pulling manifest
pulling 4cd576ddf4bc... 100% ▕████████████████▏ 40 GB # The model itself
pulling 31ec11a09e5c... 100% ▕████████████████▏ 1.3 KB # Template
pulling 2be0dde01a1e... 100% ▕████████████████▏ 11 KB # License
pulling 2434009efafa... 100% ▕████████████████▏ 1.6 KB # Parameters
pulling 0f3aa73e5463... 100% ▕████████████████▏ 491 B # Metadata
success
The download completed. Total size: 42.5GB, comfortably within our 820GB available storage. But could it actually run?
$ ./ollama run deepseek-r1:70b
>>> Explain RISC-V vector extensions in technical terms.
<think>
The user wants a technical explanation of RISC-V vector extensions. I should
cover the key aspects: SIMD capabilities, variable-length vectors, register
organization, and how they compare to other architectures...
</think>
RISC-V vector extensions (RVV) provide scalable SIMD processing capabilities
through a variable-length vector architecture. Unlike fixed-width SIMD in x86
(AVX-512) or ARM (NEON), RVV uses the VLEN (vector length) parameter which can
be implementation-defined from 128 bits to effectively unlimited.
The extension introduces 32 vector registers (v0-v31) with configurable
effective length through the vtype CSR (control and status register). This
allows the same code to run efficiently on different hardware implementations...
[continues with detailed technical explanation]
It works.
A 70-billion parameter language model is running on a RISC-V CPU. No GPU. No CUDA. No specialized AI accelerators. Just 64 cores, 125GB of RAM, and well-designed software.
Why 70B on CPU-Only ACTUALLY Works
Let’s break down the technical reality of running a 70B model on CPU:
Memory Requirements:
- 70B parameters in FP16: ~140GB (too large)
- 70B parameters in Q4_0 quantization: ~40GB (fits comfortably)
- KV cache for inference context: ~10-20GB
- OS and services overhead: ~3GB
- Total: ~53-63GB out of 125GB available
Performance Trade-offs:
- GPU inference (A100): 70B @ 20-40 tokens/sec (estimated)
- This system: 70B @ 1-5 tokens/sec (estimated based on hardware)
- Latency ratio: 5-10x slower than high-end GPU
- Cost ratio: $0 GPU cost vs $10,000+ for A100
The Strategic Win:
For research, development, experimentation, and batch processing, slower inference is acceptable. We’re not serving production API requests at scale - we’re exploring what’s possible, testing prompts, and validating that RISC-V can handle serious AI workloads.
The 125GB RAM strategy eliminates the GPU memory bottleneck entirely. There’s no swapping, no OOM kills, no careful memory management. The entire model fits in RAM with room to spare.
⚠️ Important: These performance estimates are theoretical based on hardware capabilities. Actual benchmarking with
llama-benchwould provide quantitative validation, which is planned as future work.
Challenges Deep-Dive: Lessons from the Trenches
Success stories gloss over the messy parts. Let’s talk about what made this challenging and what I learned.
Challenge 1: Ecosystem Maturity Gaps
RISC-V isn’t x86. The software ecosystem is younger, and it shows in unexpected places.
What’s missing:
- Pre-built packages: Many tools require building from source
- Optimized libraries: No mature BLAS/LAPACK implementations
- Container infrastructure: Docker/containerd need source builds
- CI/CD tooling: Jenkins required manual Java installation
I ended up building Docker’s entire stack from source: runc, crun, containerd, dockerd, plus dependencies like libseccomp. This took hours and involved wrangling multiple Go projects.
Similarly, Jenkins needed Temurin JDK 21 manually installed to /opt with profile scripts:
sudo tar -xzf temurin.tgz -C /opt
sudo sh -c 'echo "export JAVA_HOME=/opt/jdk-21.0.5+11" >> /etc/profile.d/jdk21.sh'
sudo sh -c 'echo "export PATH=\$JAVA_HOME/bin:\$PATH" >> /etc/profile.d/jdk21.sh'
The bright side:
Not everything requires source builds. Core development tools (GCC, Git, CMake, Go) are packaged in Fedora. The ecosystem is improving rapidly.
Lesson: RISC-V is transitioning from “pioneering” to “early mainstream.” Application development works well, but infrastructure tooling lags. Budget extra time for DevOps setup.
Challenge 2: Memory-Centric Performance Thinking
Without GPU acceleration, performance optimization requires different mental models.
Traditional GPU-based LLM deployment:
- Load model into GPU VRAM (limited, expensive)
- Transfer input over PCIe to GPU
- Compute on GPU (fast)
- Transfer output over PCIe to CPU
- Bottleneck: PCIe bandwidth and VRAM capacity
CPU-only RISC-V deployment:
- Load model into system RAM (abundant, cheap)
- Compute on CPU cores (slower per-operation)
- Data stays in RAM (no transfers)
- Bottleneck: CPU compute throughput and memory bandwidth
This shifts optimization strategies:
- GPU optimization: Minimize data transfer, maximize VRAM usage, batch aggressively
- CPU optimization: Maximize core utilization, optimize memory access patterns, leverage caching
Opportunities I haven’t fully explored:
- NUMA awareness: A 64-core system is likely multi-socket. NUMA-aware memory allocation could improve performance
- Memory-mapped models: Lazy-loading model weights from disk instead of full load
- Multi-model sharing: Load model once, serve multiple concurrent sessions
Challenge 3: Build Parallelism Is Language-Dependent
I mentioned this earlier, but it’s worth emphasizing: the Go compiler memory issue caught me off-guard.
On x86 systems with 8-16 cores, you rarely hit Go’s memory limits. But on a 64-core RISC-V system, suddenly compiler memory characteristics matter enormously.
Rule of thumb for high-core systems:
# C/C++ projects (low memory per job)
make -j $(nproc)
# Rust projects (medium memory per job)
make -j $(( $(nproc) / 2 ))
# Go projects (high memory per job)
make -j $(( $(nproc) / 10 ))
Monitor with htop during builds to tune parallelism for your specific system and project.
Challenge 4: No GPU Means No GPU Assumptions
Many LLM tools assume GPU availability. Examples:
- Default configurations often set
GGML_CUDA=ON - Documentation emphasizes GPU performance
- Benchmarking tools test GPU backends first
This isn’t a showstopper - it just means reading documentation carefully and understanding which settings are hardware-dependent.
The flip side: CPU-only inference is more portable. No driver version mismatches, no CUDA toolkit version conflicts, no GPU memory management headaches. Just CPU and RAM.
Reproducible Setup Guide: Your Turn
Want to replicate this on your own RISC-V system? Here’s the condensed, step-by-step guide.
Prerequisites
Hardware requirements:
- RISC-V processor with vector extensions (rv64imafdcv or similar)
- At least 64GB RAM for 13B models, 125GB+ for 70B models
- 100GB+ free storage for models
- Multiple cores recommended (parallelism helps CPU inference)
Verify hardware capabilities:
# Check for RISC-V vector extensions (must contain 'v')
cat /proc/cpuinfo | grep isa | head -1
# Expected output: isa: rv64imafdcv (or similar with 'v')
# Check available RAM
free -h
# Need 64GB+ for 13B models, 125GB+ for 70B models
# Check available storage
df -h /
# Need 100GB+ free for model storage
Software requirements:
# On Fedora RISC-V - Install dependencies
sudo dnf install -y \
gcc gcc-c++ cmake git ccache \
golang make \
libstdc++-devel \
python3 python3-pip
# Verify compiler versions
gcc --version | grep "gcc"
# Expected: gcc (GCC) 13.x or newer
cmake --version | grep "version"
# Expected: cmake version 3.20 or newer
go version
# Expected: go version go1.22 or newer
💡 Tip: If any version check fails, you may need to install a newer version from source or enable additional repositories.
Building llama.cpp
# Clone repository
cd ~/
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Configure for CPU with vector extensions
cmake -B build \
-DGGML_CPU=ON \
-DGGML_CPU_AARCH64=ON \
-DGGML_ACCELERATE=ON \
-DCMAKE_BUILD_TYPE=Release
# Build with all cores
cmake --build build --config Release -j $(nproc)
# Test
./build/bin/llama-cli --version
Expected output:
llama-cli version [commit-hash]
Troubleshooting Common Build Errors
If the llama.cpp build fails, here are the most common issues and solutions:
Error: illegal instruction or SIGILL when running llama-cli:
./build/bin/llama-cli --version
Illegal instruction (core dumped)
Cause: Your RISC-V processor lacks vector extensions (v) in its ISA.
Solution: Verify your ISA with cat /proc/cpuinfo | grep isa. If vector extensions are missing, you cannot run this build. Consider using a system with rv64imafdcv or similar ISA.
Error: cannot find -lstdc++ during linking:
/usr/bin/ld: cannot find -lstdc++: No such file or directory
Cause: Missing C++ standard library development headers.
Solution:
sudo dnf install libstdc++-devel
Build hangs or system becomes unresponsive:
Cause: Using too much parallelism (too many -j jobs) for available RAM. Each compilation job requires memory, and 64 parallel C++ compilations can exhaust RAM on systems with less memory.
Solution: Reduce parallelism:
# Instead of -j $(nproc), use a lower value
cmake --build build --config Release -j 16
# Or for systems with limited RAM
cmake --build build --config Release -j 8
Error: ggml.h: No such file or directory:
Cause: Build directory not properly initialized or git submodules not updated.
Solution:
# Clean and reconfigure
rm -rf build
git submodule update --init --recursive
cmake -B build -DGGML_CPU=ON -DGGML_CPU_AARCH64=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j $(nproc)
Build succeeds but executables are missing:
Cause: Build completed but check wrong location.
Solution: Executables are in build/bin/, not in the root directory:
ls -lh build/bin/llama-cli
./build/bin/llama-cli --version
Building Ollama
# Clone repository
cd ~/
git clone https://github.com/ollama/ollama.git
cd ollama
# Build with REDUCED parallelism for Go's memory needs
make -j 5
# Test
./ollama --version
⚠️ Warning: Do NOT use
-j $(nproc)for Ollama builds on high-core-count systems. The Go compiler requires significantly more memory per parallel job than C/C++. Start with-j 5and increase cautiously while monitoring memory usage.
Expected output:
ollama version is [version]
Deploying Models
# Start with a small model for validation
./ollama pull llama3.2
./ollama run llama3.2
>>> Hello, world!
# For larger models, ensure you have enough RAM and storage
./ollama pull deepseek-r1:70b # Downloads ~40GB
./ollama run deepseek-r1:70b
>>> Tell me about RISC-V architecture
Troubleshooting
Build fails with “illegal instruction”:
Check if your RISC-V processor has vector extensions:
cat /proc/cpuinfo | grep isa
Look for the v extension in the ISA string.
Out of memory during Ollama build:
Reduce parallelism further:
make -j 2 # Or even -j 1 for very constrained systems
Model fails to load:
Check available memory:
free -h
For 70B models, you need 60GB+ free RAM.
Lessons Learned & Future Directions
After deploying LLMs on RISC-V, here’s what stands out:
What Worked Exceptionally Well
1. Architectural Portability
llama.cpp’s design philosophy paid off: write portable code, use cross-platform abstractions, let compilers handle architecture specifics. The result: RISC-V support without RISC-V-specific patches.
This is a lesson for all software projects. Portable code isn’t just about supporting more platforms - it’s about being ready when new platforms emerge.
2. Memory-First Hardware Strategy
Abundant RAM compensates for lack of GPU acceleration effectively. For non-production workloads (research, development, experimentation), CPU-only inference with generous memory is viable.
3. RISC-V Ecosystem Maturation
Core development tools work correctly. GCC, CMake, Go all function as expected. The fundamental toolchain is solid.
What Surprised Us
1. Go Compiler Memory Behavior
I didn’t expect language choice to affect optimal build parallelism so dramatically. This isn’t documented in most build guides because it rarely matters on typical systems.
2. Vector Extension Compatibility
The fact that ARM NEON code paths work on RISC-V vector extensions speaks to good architectural design on both sides. Cross-ISA optimization portability is underappreciated.
3. Infrastructure Tooling Gaps
Applications build easily, but infrastructure (Docker, Jenkins) requires more effort. The ecosystem is maturing unevenly.
What’s Next
Near-term improvements:
-
Quantitative benchmarking: Run
llama-benchto measure actual tokens/sec - NUMA optimization: Investigate multi-socket memory topology
- Quantization comparison: Test Q4_0 vs Q8_0 vs FP16 performance/quality trade-offs
- Multi-model serving: Explore memory sharing between models
Long-term possibilities:
- RISC-V GPU support: When RISC-V GPUs emerge, combine CPU + GPU inference
- Custom accelerators: RISC-V’s extensibility allows domain-specific instructions
- Production deployment: Scale from proof-of-concept to production API services
- Community benchmarking: Establish standard RISC-V LLM benchmarks
The Broader Picture
This deployment proves that RISC-V is ready for serious AI workloads today. Not “someday when the ecosystem matures,” but right now. You can build state-of-the-art LLM software, deploy multi-billion parameter models, and run meaningful inference workloads.
The trade-offs are real - slower inference, ecosystem gaps, infrastructure complexity - but they’re manageable trade-offs, not fundamental blockers.
As AI becomes increasingly central to computing, the open-source nature of RISC-V becomes increasingly valuable. No licensing fees, no vendor lock-in, complete transparency in processor design. That matters for democratizing AI.
Conclusion: RISC-V Is Ready
When I started this project, I wondered if running a 70B LLM on RISC-V was merely possible or actually practical. The answer: both.
It’s possible because the software ecosystem (llama.cpp, Ollama, GGML) is well-designed and portable. Because RISC-V vector extensions provide necessary SIMD capabilities. Because modern compilers abstract architecture differences effectively.
It’s practical because build processes are straightforward, documentation exists, and tools work. Because you can replicate this setup in an afternoon. Because the performance, while slower than GPU, is adequate for real work.
The 64-core MilkV Pioneer with 125GB RAM isn’t the only way to run LLMs on RISC-V - it’s just the way I chose to prove it’s viable. Smaller systems can run smaller models. Future systems with GPU acceleration will be faster. But today, with CPU-only hardware, you can deploy and use state-of-the-art language models on open-source processor architecture.
If you’re working with RISC-V, experimenting with LLMs, or interested in AI on alternative architectures, this is your roadmap. The hardware exists, the software works, and the ecosystem is improving rapidly.
Now go build something.
Further Resources
Hardware:
- MilkV Pioneer: https://milkv.io/pioneer
- RISC-V International: https://riscv.org/
Software:
- llama.cpp: https://github.com/ggerganov/llama.cpp
- Ollama: https://github.com/ollama/ollama
- GGML tensor library: https://github.com/ggerganov/ggml
Models:
- Ollama model library: https://ollama.com/library
- Hugging Face GGUF models: https://huggingface.co/models?library=gguf
Community:
- RISC-V Software Forum: https://groups.google.com/a/groups.riscv.org/g/sw-dev
- llama.cpp Discussions: https://github.com/ggerganov/llama.cpp/discussions
- Ollama Discord: https://discord.gg/ollama
Questions, feedback, or your own RISC-V LLM deployment stories? I’d love to hear about them. The RISC-V AI ecosystem grows through shared knowledge and experimentation.