Running a 70B LLM on Pure RISC-V: The MilkV Pioneer Deployment Journey

When the 40GB download completed and the model loaded into memory, I wondered: would a 70-billion parameter language model actually run on a RISC-V CPU? No GPU. No CUDA. Just 64 cores and a lot of RAM. Spoiler: it worked. Here’s the complete technical journey of deploying llama.cpp and Ollama on the MilkV Pioneer, including build challenges, architectural insights, and practical lessons for running state-of-the-art LLMs on CPU-only RISC-V hardware.

Why This Matters: LLMs Meet RISC-V

RISC-V is the underdog processor architecture. While x86 and ARM dominate the computing landscape with decades of software optimization and hardware acceleration, RISC-V is the open-source upstart. It’s the architecture you’ve probably never heard of unless you follow hardware closely. And running modern AI workloads on RISC-V? That’s even more niche.

But here’s the thing: RISC-V represents something important. It’s an open ISA (Instruction Set Architecture) that anyone can implement without licensing fees. It’s the Linux of processor architectures. And as AI becomes increasingly central to computing, the question isn’t just “Can we run LLMs on RISC-V?” but rather “How do we democratize AI hardware?”

I got my hands on a MilkV Pioneer - a 64-core RISC-V workstation - and decided to find out if it could run serious language models. Not 7B parameter toy models, but the big ones. The 70-billion parameter models that typically require expensive NVIDIA GPUs.

The equation seemed impossible at first: a 70B parameter model in FP16 precision requires roughly 140GB of memory. Our system has 125GB of RAM. Yet with quantization and smart memory management, it worked. This article documents the entire journey: the hardware choices, build challenges, surprising insights, and step-by-step instructions for anyone wanting to replicate this.

⚠️ Important: Scope and Limitations: This article documents deployment feasibility and build processes for running LLMs on RISC-V hardware. Quantitative performance benchmarks (tokens per second, latency, throughput) were not collected during this deployment. Performance estimates mentioned in this article (such as “1-5 tokens/sec”) are theoretical extrapolations based on hardware characteristics, not measured data.

Readers planning production deployments should conduct their own benchmarking with llama-bench or similar tools to measure actual performance on their specific hardware and workloads.

Hardware Foundation: The MilkV Pioneer

Before diving into software, let’s talk about the hardware that made this possible. The MilkV Pioneer isn’t your typical development board. It’s a serious workstation-class machine designed around RISC-V processors.

System Specifications

Here’s what we’re working with:

Component	Specification
CPU	64-core RISC-V processor (vendor ID: 0x5b7)
ISA	rv64imafdcv
Memory	125GB RAM + 8GB swap (122GB available after OS overhead)
Storage	939GB NVMe SSD (119GB used, 820GB available)
OS	Fedora Linux 38 Workstation (riscv64)
GPU	None - pure CPU inference only

Let me unpack that ISA specification because it’s crucial to understanding why this deployment worked.

Decoding rv64imafdcv: The Secret Sauce

The ISA string rv64imafdcv is more than just alphabet soup. Each letter represents a set of processor capabilities:

Extension	Full Name	Significance for LLM Inference
rv64	64-bit base integer ISA	Enables addressing the full 125GB memory space for large models
i	Integer operations	Base integer arithmetic operations
m	Multiply/Divide	Critical for matrix operations in transformer architectures
a	Atomic instructions	Thread synchronization for parallel inference across 64 cores
f	Single-precision floating-point	FP32 operations for model weights and computations
d	Double-precision floating-point	FP64 for high-precision accumulation where needed
c	Compressed instructions	16-bit instruction encoding (reduces code size)
v	Vector extensions	CRITICAL: SIMD operations for tensor mathematics

The v extension is the game-changer here. Vector extensions provide SIMD (Single Instruction, Multiple Data) capabilities similar to x86 AVX or ARM NEON. Without these vector extensions, CPU-only inference would be 5-10x slower. With them, we can efficiently parallelize the matrix multiplications that dominate transformer model inference.

This is why llama.cpp’s GGML_CPU_AARCH64=ON configuration works on RISC-V: the vector extensions are architecturally similar enough to ARM NEON that the same optimized code paths function correctly. Cross-architecture code reuse for the win.

The Strategic Hardware Choice: Memory First

You might notice something unusual about this system: 125GB of RAM is massive for a workstation. Why so much?

The answer reveals our core strategy: strategic compensation. When you don’t have GPU acceleration, you compensate with other resources:

No GPU memory? → Use abundant system RAM instead
Slower per-core inference? → Throw 64 cores at the problem
Limited software ecosystem? → Choose portable, well-designed software

This memory-first approach enables loading entire large models into RAM without swapping to disk. A 70B parameter model quantized to Q4_0 format requires roughly 40GB of memory. With 125GB available, we have comfortable headroom for:

The model itself (~40GB)
Operating system and services (~3GB)
Inference state and KV cache (~10-20GB)
Multiple concurrent sessions if needed

The result: no disk swapping, no memory pressure, and the ability to load models that would choke smaller systems.

Journey Part 1: Building llama.cpp

With hardware in place, the real challenge began: getting cutting-edge LLM software to build on an architecture it was never explicitly designed for.

Toolchain Exploration: Finding Solid Ground

My first instinct was to use the latest compiler. GCC 14.2.0 source build? Sure, bleeding-edge features sound great. I also explored the XUANTIE RISC-V GNU Toolchain from T-Head (Alibaba’s RISC-V division), thinking vendor-specific optimizations might help.

But sometimes newer isn’t better, especially on emerging architectures. After experimentation, I settled on the system-provided GCC 13.2.1 from Red Hat. Why?

Proven stability: Red Hat’s testing means fewer surprises
Ecosystem integration: Works seamlessly with Fedora’s packages
ccache support: Dramatically speeds incremental builds
Good enough: Compiler optimization differences rarely matter as much as architecture-specific code paths

The lesson: on cutting-edge hardware, conservative toolchain choices often win.

The Build: Surprisingly Straightforward

llama.cpp is designed for portability, and it shows. The build process was refreshingly simple:

# Clone the repository
cd ~/
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Configure with CMake
cmake -B build

# Build with all 64 cores
cmake --build build --config Release -j 64

That’s it. No special patches, no RISC-V-specific forks, no fighting with dependencies. Just standard CMake commands.

The build took advantage of all 64 cores (hence -j 64) and ccache reduced rebuild times significantly. Within minutes, we had 30+ executables in build/bin/:

$ ls -lh build/bin/ | grep llama
-rwxr-xr-x. 1 user user 1.4M Jan 28 10:23 llama-cli
-rwxr-xr-x. 1 user user 1.5M Jan 28 10:23 llama-server
-rwxr-xr-x. 1 user user 1007K Jan 28 10:23 llama-bench
-rwxr-xr-x. 1 user user 1.2M Jan 28 10:23 llama-quantize
-rwxr-xr-x. 1 user user 1.4M Jan 28 10:23 libllama.so

Configuration Deep-Dive: What Makes It Work

The real magic happens in the CMake configuration. Let’s look at the key settings from build/CMakeCache.txt:

# CPU backend configuration
GGML_CPU:BOOL=ON                    # CPU backend enabled
GGML_CPU_AARCH64:BOOL=ON            # ← The critical setting
GGML_ACCELERATE:BOOL=ON             # Generic CPU acceleration
GGML_CCACHE:BOOL=ON                 # Build caching for speed

# Explicitly disabled (hardware constraints)
GGML_CUDA:BOOL=OFF                  # No NVIDIA GPU
GGML_AVX:BOOL=OFF                   # x86-specific, not available
GGML_AVX2:BOOL=OFF                  # x86-specific, not available
GGML_AVX512:BOOL=OFF                # x86-specific, not available

# Not used (simplicity)
GGML_BLAS:BOOL=OFF                  # No external BLAS library

# Build configuration
CMAKE_BUILD_TYPE:STRING=Release     # Optimize for performance
CMAKE_C_COMPILER=/usr/lib64/ccache/cc  # GCC 13.2.1 with ccache

The GGML_CPU_AARCH64=ON setting deserves special attention. This tells llama.cpp to use its ARM-optimized code paths on RISC-V. Why does this work?

RISC-V vector extensions (the v in rv64imafdcv) provide SIMD capabilities architecturally similar to ARM NEON. The compiler can map ARM intrinsics to RISC-V vector instructions, allowing code written for aarch64 to function correctly on RISC-V. This is architectural portability at its finest - well-designed abstractions enable cross-platform optimization.

Another interesting choice: GGML_BLAS=OFF. We’re not using an external BLAS (Basic Linear Algebra Subprograms) library. Why not?

Possible reasons:

No optimized BLAS library readily available for RISC-V
llama.cpp’s built-in GGML operations sufficient for performance
Reducing dependencies simplifies the build and deployment
RISC-V numerical library ecosystem not yet mature

I call this “strategic simplicity” - minimizing moving parts to ensure stability. When blazing new trails on emerging architectures, fewer dependencies mean fewer things that can break.

⚠️ Warning: The GGML_CPU_AARCH64=ON configuration works on RISC-V processors with vector extensions. If your RISC-V system lacks the v extension, you’ll need different settings and should expect significantly slower performance.

Validation: Does It Actually Work?

Build success doesn’t mean functionality. Let’s test:

$ ./build/bin/llama-cli --version
llama-cli version b8595b16

$ ./build/bin/llama-bench --help
usage: llama-bench [options]
...

It runs. The binaries execute without errors, help text displays correctly, and we can move forward. Small victories matter when working with new architectures.

Journey Part 2: Deploying Ollama

llama.cpp proved that RISC-V could run LLM inference engines. But llama.cpp is developer-focused - you manage GGUF files manually, pass command-line arguments for every option, and handle model lifecycle yourself.

For actual usability, I wanted Ollama. Ollama wraps llama.cpp in a friendly interface with automatic model downloading, versioning, and an OpenAI-compatible HTTP API. It’s the difference between git and GitHub Desktop - same power, much better UX.

The First Build Attempt: A Lesson in Memory

Ollama is written in Go and wraps llama.cpp for the actual inference. My first build attempt used the same parallelism strategy that worked for llama.cpp:

cd ~/
git clone https://github.com/ollama/ollama.git
cd ollama

# Try building with all cores
make -j 64

This failed. Not with compiler errors, but with the system grinding to a halt. Memory usage spiked, swap thrashed, and the build eventually died.

What went wrong?

The Go compiler has fundamentally different memory characteristics than C/C++ compilers:

Aspect	C/C++ (GCC)	Go
Compilation model	Per-file, independent	Whole-program optimization
Memory per job	~500MB-1GB	~2-4GB
64 parallel jobs	~32-64GB (fits in 125GB)	~128-256GB (exceeds capacity)

The math was simple: Go’s compiler maintains large in-memory representations during compilation. With 64 parallel jobs, we’d need 128-256GB of memory. Our system has 125GB. Recipe for disaster.

The Solution: Language-Specific Build Strategies

The fix was equally simple:

# Reduce parallelism for Go's memory requirements
make -j 5

With only 5 parallel jobs (~10-20GB memory), the build completed successfully:

$ ls -lh ollama
-rwxr-xr-x. 1 user user 27M Jan 28 11:45 ollama

$ ./ollama --version
ollama version is 0.5.7-6-g2ef3c80

Key lesson: On high-core-count systems, build parallelism isn’t just $(nproc). It depends on the language implementation:

C/C++ projects: -j $(nproc) usually works
Go projects: -j $(( $(nproc) / 10 )) or less to limit memory
Rust projects: Somewhere in between (depends on project size)

This is especially relevant for RISC-V systems, which often have higher core counts to compensate for lower single-threaded performance.

💡 Tip: If you’re building Go projects on high-core-count machines, start with low parallelism and increase gradually while monitoring memory usage. The fastest build is the one that completes without thrashing.

Architecture Comparison: llama.cpp vs Ollama

Now that we have both systems built, let’s compare their trade-offs:

Dimension	llama.cpp	Ollama
Implementation	C/C++ (performance-first)	Go (productivity-first) wrapping llama.cpp
Distribution	30+ specialized binaries	Single unified binary (27MB)
Model management	Manual GGUF file handling	Automatic download and versioning
API	Low-level C API + CLI tools	HTTP API (OpenAI-compatible)
Build complexity	CMake, multiple targets	Makefile, single target (but higher memory)
Best for	Custom integration, embedded systems, maximum control	Rapid prototyping, user-friendly deployment, API services

For this deployment, I wanted both: llama.cpp for understanding the low-level mechanics, and Ollama for actually using the models day-to-day.

The 70B Milestone: Pushing the Limits

With both llama.cpp and Ollama successfully built, it was time to deploy actual models. I started small to validate the setup, then went big to prove the concept.

Warmup: llama3.2 Deployment

First, a sanity check with a smaller model:

$ ./ollama pull llama3.2
pulling manifest
pulling dde5aa3fc5ffc... 100% ▕████████████████▏ 1.9 GB
pulling 966de95ca8a6... 100% ▕████████████████▏ 1.4 KB
pulling fcc5a6bec9da... 100% ▕████████████████▏ 7.7 KB
pulling a70ff7e570d9... 100% ▕████████████████▏ 6.0 KB
pulling 56bb8bd477a5... 100% ▕████████████████▏  96 B
pulling 34bb5ab01051... 100% ▕████████████████▏ 561 B
success

$ ./ollama run llama3.2
>>> Hello! How are you?
Hello! I'm doing well, thank you for asking. I'm a large language model,
so I don't have feelings like humans do, but I'm functioning properly and
ready to help with any questions or tasks you might have. How can I assist
you today?

Success! A 2GB model downloads, loads, and responds correctly. The system works.

The Big One: deepseek-r1:70b

Now for the real test. DeepSeek-R1 is a 70-billion parameter model - one of the largest open-source language models available. The download alone is 40GB:

$ ./ollama pull deepseek-r1:70b
pulling manifest
pulling 4cd576ddf4bc... 100% ▕████████████████▏ 40 GB   # The model itself
pulling 31ec11a09e5c... 100% ▕████████████████▏ 1.3 KB  # Template
pulling 2be0dde01a1e... 100% ▕████████████████▏ 11 KB   # License
pulling 2434009efafa... 100% ▕████████████████▏ 1.6 KB  # Parameters
pulling 0f3aa73e5463... 100% ▕████████████████▏  491 B  # Metadata
success

The download completed. Total size: 42.5GB, comfortably within our 820GB available storage. But could it actually run?

$ ./ollama run deepseek-r1:70b
>>> Explain RISC-V vector extensions in technical terms.

<think>
The user wants a technical explanation of RISC-V vector extensions. I should
cover the key aspects: SIMD capabilities, variable-length vectors, register
organization, and how they compare to other architectures...
</think>

RISC-V vector extensions (RVV) provide scalable SIMD processing capabilities
through a variable-length vector architecture. Unlike fixed-width SIMD in x86
(AVX-512) or ARM (NEON), RVV uses the VLEN (vector length) parameter which can
be implementation-defined from 128 bits to effectively unlimited.

The extension introduces 32 vector registers (v0-v31) with configurable
effective length through the vtype CSR (control and status register). This
allows the same code to run efficiently on different hardware implementations...

[continues with detailed technical explanation]

It works.

A 70-billion parameter language model is running on a RISC-V CPU. No GPU. No CUDA. No specialized AI accelerators. Just 64 cores, 125GB of RAM, and well-designed software.

Why 70B on CPU-Only ACTUALLY Works

Let’s break down the technical reality of running a 70B model on CPU:

Memory Requirements:

70B parameters in FP16: ~140GB (too large)
70B parameters in Q4_0 quantization: ~40GB (fits comfortably)
KV cache for inference context: ~10-20GB
OS and services overhead: ~3GB
Total: ~53-63GB out of 125GB available

Performance Trade-offs:

GPU inference (A100): 70B @ 20-40 tokens/sec (estimated)
This system: 70B @ 1-5 tokens/sec (estimated based on hardware)
Latency ratio: 5-10x slower than high-end GPU
Cost ratio: $0 GPU cost vs $10,000+ for A100

The Strategic Win:

For research, development, experimentation, and batch processing, slower inference is acceptable. We’re not serving production API requests at scale - we’re exploring what’s possible, testing prompts, and validating that RISC-V can handle serious AI workloads.

The 125GB RAM strategy eliminates the GPU memory bottleneck entirely. There’s no swapping, no OOM kills, no careful memory management. The entire model fits in RAM with room to spare.

⚠️ Important: These performance estimates are theoretical based on hardware capabilities. Actual benchmarking with llama-bench would provide quantitative validation, which is planned as future work.

Challenges Deep-Dive: Lessons from the Trenches

Success stories gloss over the messy parts. Let’s talk about what made this challenging and what I learned.

Challenge 1: Ecosystem Maturity Gaps

RISC-V isn’t x86. The software ecosystem is younger, and it shows in unexpected places.

What’s missing:

Pre-built packages: Many tools require building from source
Optimized libraries: No mature BLAS/LAPACK implementations
Container infrastructure: Docker/containerd need source builds
CI/CD tooling: Jenkins required manual Java installation

I ended up building Docker’s entire stack from source: runc, crun, containerd, dockerd, plus dependencies like libseccomp. This took hours and involved wrangling multiple Go projects.

Similarly, Jenkins needed Temurin JDK 21 manually installed to /opt with profile scripts:

sudo tar -xzf temurin.tgz -C /opt
sudo sh -c 'echo "export JAVA_HOME=/opt/jdk-21.0.5+11" >> /etc/profile.d/jdk21.sh'
sudo sh -c 'echo "export PATH=\$JAVA_HOME/bin:\$PATH" >> /etc/profile.d/jdk21.sh'

The bright side:

Not everything requires source builds. Core development tools (GCC, Git, CMake, Go) are packaged in Fedora. The ecosystem is improving rapidly.

Lesson: RISC-V is transitioning from “pioneering” to “early mainstream.” Application development works well, but infrastructure tooling lags. Budget extra time for DevOps setup.

Challenge 2: Memory-Centric Performance Thinking

Without GPU acceleration, performance optimization requires different mental models.

Traditional GPU-based LLM deployment:

Load model into GPU VRAM (limited, expensive)
Transfer input over PCIe to GPU
Compute on GPU (fast)
Transfer output over PCIe to CPU
Bottleneck: PCIe bandwidth and VRAM capacity

CPU-only RISC-V deployment:

Load model into system RAM (abundant, cheap)
Compute on CPU cores (slower per-operation)
Data stays in RAM (no transfers)
Bottleneck: CPU compute throughput and memory bandwidth

This shifts optimization strategies:

GPU optimization: Minimize data transfer, maximize VRAM usage, batch aggressively
CPU optimization: Maximize core utilization, optimize memory access patterns, leverage caching

Opportunities I haven’t fully explored:

NUMA awareness: A 64-core system is likely multi-socket. NUMA-aware memory allocation could improve performance
Memory-mapped models: Lazy-loading model weights from disk instead of full load
Multi-model sharing: Load model once, serve multiple concurrent sessions

Challenge 3: Build Parallelism Is Language-Dependent

I mentioned this earlier, but it’s worth emphasizing: the Go compiler memory issue caught me off-guard.

On x86 systems with 8-16 cores, you rarely hit Go’s memory limits. But on a 64-core RISC-V system, suddenly compiler memory characteristics matter enormously.

Rule of thumb for high-core systems:

# C/C++ projects (low memory per job)
make -j $(nproc)

# Rust projects (medium memory per job)
make -j $(( $(nproc) / 2 ))

# Go projects (high memory per job)
make -j $(( $(nproc) / 10 ))

Monitor with htop during builds to tune parallelism for your specific system and project.

Challenge 4: No GPU Means No GPU Assumptions

Many LLM tools assume GPU availability. Examples:

Default configurations often set GGML_CUDA=ON
Documentation emphasizes GPU performance
Benchmarking tools test GPU backends first

This isn’t a showstopper - it just means reading documentation carefully and understanding which settings are hardware-dependent.

The flip side: CPU-only inference is more portable. No driver version mismatches, no CUDA toolkit version conflicts, no GPU memory management headaches. Just CPU and RAM.

Reproducible Setup Guide: Your Turn

Want to replicate this on your own RISC-V system? Here’s the condensed, step-by-step guide.

Prerequisites

Hardware requirements:

RISC-V processor with vector extensions (rv64imafdcv or similar)
At least 64GB RAM for 13B models, 125GB+ for 70B models
100GB+ free storage for models
Multiple cores recommended (parallelism helps CPU inference)

Verify hardware capabilities:

# Check for RISC-V vector extensions (must contain 'v')
cat /proc/cpuinfo | grep isa | head -1
# Expected output: isa: rv64imafdcv (or similar with 'v')

# Check available RAM
free -h
# Need 64GB+ for 13B models, 125GB+ for 70B models

# Check available storage
df -h /
# Need 100GB+ free for model storage

Software requirements:

# On Fedora RISC-V - Install dependencies
sudo dnf install -y \
    gcc gcc-c++ cmake git ccache \
    golang make \
    libstdc++-devel \
    python3 python3-pip

# Verify compiler versions
gcc --version | grep "gcc"
# Expected: gcc (GCC) 13.x or newer

cmake --version | grep "version"
# Expected: cmake version 3.20 or newer

go version
# Expected: go version go1.22 or newer

💡 Tip: If any version check fails, you may need to install a newer version from source or enable additional repositories.

Building llama.cpp

# Clone repository
cd ~/
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Configure for CPU with vector extensions
cmake -B build \
  -DGGML_CPU=ON \
  -DGGML_CPU_AARCH64=ON \
  -DGGML_ACCELERATE=ON \
  -DCMAKE_BUILD_TYPE=Release

# Build with all cores
cmake --build build --config Release -j $(nproc)

# Test
./build/bin/llama-cli --version

Expected output:

llama-cli version [commit-hash]

Troubleshooting Common Build Errors

If the llama.cpp build fails, here are the most common issues and solutions:

Error: illegal instruction or SIGILL when running llama-cli:

./build/bin/llama-cli --version
Illegal instruction (core dumped)

Cause: Your RISC-V processor lacks vector extensions (v) in its ISA.

Solution: Verify your ISA with cat /proc/cpuinfo | grep isa. If vector extensions are missing, you cannot run this build. Consider using a system with rv64imafdcv or similar ISA.

Error: cannot find -lstdc++ during linking:

/usr/bin/ld: cannot find -lstdc++: No such file or directory

Cause: Missing C++ standard library development headers.

Solution:

sudo dnf install libstdc++-devel

Build hangs or system becomes unresponsive:

Cause: Using too much parallelism (too many -j jobs) for available RAM. Each compilation job requires memory, and 64 parallel C++ compilations can exhaust RAM on systems with less memory.

Solution: Reduce parallelism:

# Instead of -j $(nproc), use a lower value
cmake --build build --config Release -j 16

# Or for systems with limited RAM
cmake --build build --config Release -j 8

Error: ggml.h: No such file or directory:

Cause: Build directory not properly initialized or git submodules not updated.

Solution:

# Clean and reconfigure
rm -rf build
git submodule update --init --recursive
cmake -B build -DGGML_CPU=ON -DGGML_CPU_AARCH64=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j $(nproc)

Build succeeds but executables are missing:

Cause: Build completed but check wrong location.

Solution: Executables are in build/bin/, not in the root directory:

ls -lh build/bin/llama-cli
./build/bin/llama-cli --version

Building Ollama

# Clone repository
cd ~/
git clone https://github.com/ollama/ollama.git
cd ollama

# Build with REDUCED parallelism for Go's memory needs
make -j 5

# Test
./ollama --version

⚠️ Warning: Do NOT use -j $(nproc) for Ollama builds on high-core-count systems. The Go compiler requires significantly more memory per parallel job than C/C++. Start with -j 5 and increase cautiously while monitoring memory usage.

Expected output:

ollama version is [version]

Deploying Models

# Start with a small model for validation
./ollama pull llama3.2
./ollama run llama3.2
>>> Hello, world!

# For larger models, ensure you have enough RAM and storage
./ollama pull deepseek-r1:70b  # Downloads ~40GB
./ollama run deepseek-r1:70b
>>> Tell me about RISC-V architecture

Troubleshooting

Build fails with “illegal instruction”:

Check if your RISC-V processor has vector extensions:

cat /proc/cpuinfo | grep isa

Look for the v extension in the ISA string.

Out of memory during Ollama build:

Reduce parallelism further:

make -j 2  # Or even -j 1 for very constrained systems

Model fails to load:

Check available memory:

free -h

For 70B models, you need 60GB+ free RAM.

Lessons Learned & Future Directions

After deploying LLMs on RISC-V, here’s what stands out:

What Worked Exceptionally Well

1. Architectural Portability

llama.cpp’s design philosophy paid off: write portable code, use cross-platform abstractions, let compilers handle architecture specifics. The result: RISC-V support without RISC-V-specific patches.

This is a lesson for all software projects. Portable code isn’t just about supporting more platforms - it’s about being ready when new platforms emerge.

2. Memory-First Hardware Strategy

Abundant RAM compensates for lack of GPU acceleration effectively. For non-production workloads (research, development, experimentation), CPU-only inference with generous memory is viable.

3. RISC-V Ecosystem Maturation

Core development tools work correctly. GCC, CMake, Go all function as expected. The fundamental toolchain is solid.

What Surprised Us

1. Go Compiler Memory Behavior

I didn’t expect language choice to affect optimal build parallelism so dramatically. This isn’t documented in most build guides because it rarely matters on typical systems.

2. Vector Extension Compatibility

The fact that ARM NEON code paths work on RISC-V vector extensions speaks to good architectural design on both sides. Cross-ISA optimization portability is underappreciated.

3. Infrastructure Tooling Gaps

Applications build easily, but infrastructure (Docker, Jenkins) requires more effort. The ecosystem is maturing unevenly.

What’s Next

Near-term improvements:

Quantitative benchmarking: Run llama-bench to measure actual tokens/sec
NUMA optimization: Investigate multi-socket memory topology
Quantization comparison: Test Q4_0 vs Q8_0 vs FP16 performance/quality trade-offs
Multi-model serving: Explore memory sharing between models

Long-term possibilities:

RISC-V GPU support: When RISC-V GPUs emerge, combine CPU + GPU inference
Custom accelerators: RISC-V’s extensibility allows domain-specific instructions
Production deployment: Scale from proof-of-concept to production API services
Community benchmarking: Establish standard RISC-V LLM benchmarks

The Broader Picture

This deployment proves that RISC-V is ready for serious AI workloads today. Not “someday when the ecosystem matures,” but right now. You can build state-of-the-art LLM software, deploy multi-billion parameter models, and run meaningful inference workloads.

The trade-offs are real - slower inference, ecosystem gaps, infrastructure complexity - but they’re manageable trade-offs, not fundamental blockers.

As AI becomes increasingly central to computing, the open-source nature of RISC-V becomes increasingly valuable. No licensing fees, no vendor lock-in, complete transparency in processor design. That matters for democratizing AI.

Conclusion: RISC-V Is Ready

When I started this project, I wondered if running a 70B LLM on RISC-V was merely possible or actually practical. The answer: both.

It’s possible because the software ecosystem (llama.cpp, Ollama, GGML) is well-designed and portable. Because RISC-V vector extensions provide necessary SIMD capabilities. Because modern compilers abstract architecture differences effectively.

It’s practical because build processes are straightforward, documentation exists, and tools work. Because you can replicate this setup in an afternoon. Because the performance, while slower than GPU, is adequate for real work.

The 64-core MilkV Pioneer with 125GB RAM isn’t the only way to run LLMs on RISC-V - it’s just the way I chose to prove it’s viable. Smaller systems can run smaller models. Future systems with GPU acceleration will be faster. But today, with CPU-only hardware, you can deploy and use state-of-the-art language models on open-source processor architecture.

If you’re working with RISC-V, experimenting with LLMs, or interested in AI on alternative architectures, this is your roadmap. The hardware exists, the software works, and the ecosystem is improving rapidly.

Now go build something.

Further Resources

Hardware:

MilkV Pioneer: https://milkv.io/pioneer
RISC-V International: https://riscv.org/

Software:

llama.cpp: https://github.com/ggerganov/llama.cpp
Ollama: https://github.com/ollama/ollama
GGML tensor library: https://github.com/ggerganov/ggml

Models:

Ollama model library: https://ollama.com/library
Hugging Face GGUF models: https://huggingface.co/models?library=gguf

Community:

RISC-V Software Forum: https://groups.google.com/a/groups.riscv.org/g/sw-dev
llama.cpp Discussions: https://github.com/ggerganov/llama.cpp/discussions
Ollama Discord: https://discord.gg/ollama

Questions, feedback, or your own RISC-V LLM deployment stories? I’d love to hear about them. The RISC-V AI ecosystem grows through shared knowledge and experimentation.

10 Nov 2025

bruno.verachten.fr