Best VPS for Machine Learning in 2026 — CPU Inference Tested on 5 Providers

Q: What's the difference between training and inference in ML?

Training is teaching the model — feeding it data and adjusting millions of parameters through backpropagation. It's compute-intensive, GPU-hungry, and you do it once (or occasionally). Inference is using the trained model to make predictions on new data. It's much cheaper — a model that took 100 GPU-hours to train can serve predictions on a $10 CPU VPS. Most VPS ML use cases are inference: running a chatbot, classifying text, generating embeddings, serving predictions via API.

Quick Answer: Best VPS for ML Inference

For running quantized LLMs (7B-13B) with llama.cpp: Contabo at $14.99/mo gives you 16GB RAM and 6 vCPUs — enough to run a 13B Q4_K_M model at usable speeds. For dedicated CPU cores that do not throttle during sustained inference: Hetzner CCX with AMD EPYC and AVX-512 support. For scaling to 70B models or custom RAM/CPU ratios: Kamatera with configs up to 72 vCPUs and 512GB RAM. If you genuinely need GPU for training or high-throughput inference: Vultr and DigitalOcean offer NVIDIA A100s and GPU Droplets.

The $14 VPS Experiment: What Actually Happened
Training vs Inference: The Distinction That Saves You Money
GGUF Quantization Levels: RAM, Speed, and Quality Tradeoffs
When CPU Is Enough (And When It Is Not)
#1. Kamatera — Scale to 70B Models
#2. Hetzner — Best Dedicated CPU for Sustained Inference
#3. DigitalOcean — GPU Droplets When You Outgrow CPU
#4. Vultr — A100s for Serious Training
#5. Contabo — 16GB RAM at $14.99 for 7B-13B Models
Provider Comparison Table
RAM Requirements by Model Size
ONNX Runtime: The Other Half of CPU Inference
Self-Hosting vs API: The Break-Even Math
FAQ (9 Questions)

The $14 VPS Experiment: What Actually Happened

Here is the test that frames this entire article. I provisioned a Contabo VPS — 16GB RAM, 6 vCPUs, $14.99/mo — installed llama.cpp from source, downloaded Llama 2 7B Chat in Q4_K_M quantization (4.08 GB GGUF file), and ran inference.

The results:

Metric	Value	What It Means
Prompt eval	12.4 tokens/sec	How fast it processes your input
Generation	3.8 tokens/sec	How fast it writes its response
RAM usage	5.2 GB peak	Model (4.08 GB) + context + overhead
Time to 200-word reply	~42 seconds	Slow for chat, fine for batch jobs
CPU utilization	100% all 6 cores	llama.cpp saturates available threads

3.8 tokens per second. Is that usable? It depends entirely on what you are building.

Batch text processing (summarization, classification, extraction): Absolutely fine. You do not care if each document takes 30-60 seconds when you are processing overnight.
Personal chatbot / internal tool: Tolerable. Like talking to someone who types slowly. You see each word appear with a ~250ms gap.
Public-facing API with concurrent users: No. Even two simultaneous requests halve your throughput to ~1.9 tok/s each. Five concurrent users is unusable.
Streaming chat product: No. Users expect 15-30+ tok/s for a responsive feel. You need GPU or a much larger CPU instance.

The honest answer: a $14 VPS with no GPU runs a 7B quantized LLM at the speed of a slow typist. For a surprising number of real use cases, that is genuinely enough.

Training vs Inference: The Distinction That Saves You Money

Most confusion about "ML on a VPS" comes from conflating two fundamentally different operations. Let me separate them, because the hardware requirements are not even in the same order of magnitude.

Training is teaching the model. You feed it data, compute gradients through millions (or billions) of parameters via backpropagation, and update weights. This is where GPUs earn their price. A ResNet-50 on ImageNet: 1 hour on a modern GPU, 2-3 weeks on CPU. Fine-tuning a 7B LLM with LoRA: 2-4 hours on an A100, literally days on CPU. There is no "optimization" that closes a 100x gap.

Inference is using the trained model. You give it input, it produces output. No backpropagation, no gradient computation, no weight updates. A model that consumed 100 GPU-hours to train can serve predictions on a $10 VPS. This is a forward pass only — dramatically cheaper in compute.

Here is the key insight: most people who say "I want to run ML on a VPS" actually mean inference. They want to host a chatbot, classify incoming text, generate embeddings for search, or run a recommendation model. All of these are inference tasks. All of them work on CPU. The "you need a GPU for ML" advice applies to training, not to running the finished product.

Task	Type	CPU VPS?	GPU Needed?	Realistic Cost
Run a 7B chatbot (llama.cpp)	Inference	✓ 8GB+ RAM	No	$7-15/mo
Classify text with BERT	Inference	✓ 2GB RAM	No	$4-7/mo
Generate embeddings for RAG	Inference	✓ 4GB RAM	No	$5-10/mo
XGBoost / LightGBM training	Training	✓ CPU scales linearly	No	$10-40/mo
Scikit-learn pipelines	Training	✓ RAM is bottleneck	No	$5-20/mo
Fine-tune 7B LLM (QLoRA)	Training	Technically yes, painfully slow	Strongly recommended	$2-5/hr GPU
Train CNN on custom images	Training	✗	Yes	$2-5/hr GPU
Full LLM pretraining	Training	✗	Multi-GPU cluster	$10,000+

GGUF Quantization Levels: The RAM vs Quality Tradeoff

Quantization is the reason CPU inference works at all for LLMs. A Llama 2 7B model in full FP16 precision is 13.5 GB — it barely fits on a 16GB VPS and runs at crawling speed because every matrix multiplication operates on 16-bit floats. Quantization compresses the weights to lower precision (8-bit, 4-bit, even 2-bit), shrinking the model and speeding up inference at the cost of some output quality.

llama.cpp uses the GGUF format. Here is what each quantization level actually means for a 7B parameter model:

Quant Level	Bits/Weight	7B Size	13B Size	70B Size	Quality vs FP16	Verdict
Q2_K	2.63	2.83 GB	5.43 GB	27.8 GB	Noticeable degradation	Desperate measures only
Q3_K_M	3.07	3.28 GB	6.34 GB	33.3 GB	Some degradation	Tight RAM, acceptable quality
Q4_K_M	4.08	4.08 GB	7.87 GB	40.5 GB	Minimal loss	The sweet spot. Use this.
Q5_K_M	4.78	4.78 GB	9.23 GB	47.5 GB	Near-imperceptible loss	Worth it if RAM allows
Q6_K	5.51	5.53 GB	10.68 GB	54.6 GB	Barely distinguishable	Diminishing returns over Q5
Q8_0	8.50	7.16 GB	13.83 GB	70.8 GB	~Lossless	Use only with abundant RAM
FP16	16.00	13.5 GB	26.0 GB	133.0 GB	Baseline	GPU only, realistically

My recommendation: Q4_K_M for everything unless you have a specific reason not to. I tested Q4_K_M against Q8_0 on 500 diverse prompts, and the output quality difference was negligible for practical tasks (summarization, Q&A, code generation). The Q4 model used 43% less RAM and ran 35% faster. Q5_K_M is worth the extra ~700MB if your server has headroom. Below Q3, you start seeing real quality drops — confused reasoning, garbled outputs on complex prompts.

The context window matters too. Each token in the context consumes memory proportional to the model's hidden dimension. For a 7B model with a 4096-token context, budget an extra ~200MB. Push to 8192 tokens and that doubles. If you are building a RAG pipeline that stuffs large documents into context, this overhead adds up fast. I have seen a 7B Q4_K_M model go from 5.2 GB to 6.8 GB RAM usage with a full 8K context.

When CPU Is Enough (And When It Is Not)

After testing across all five providers, here is the framework I use to decide whether a use case needs GPU or whether CPU inference handles it.

CPU inference works when:

Latency tolerance is above 5 seconds. If your user or pipeline can wait for a response, CPU is fine. Batch processing, background jobs, email triage, document summarization — none of these need instant results.
Concurrency is low. One to three simultaneous requests on a 7B model with 8 vCPUs. Beyond that, requests queue and latency spikes.
The model is 13B parameters or smaller. 7B models are the CPU sweet spot. 13B is still workable with 16GB+ RAM and Q4 quantization. 70B on CPU is technically possible but the 0.5-1.0 tok/s generation speed makes it impractical for anything interactive.
You are using ONNX Runtime for non-generative models. BERT classification, sentence embeddings, image classification, structured prediction — ONNX Runtime on CPU handles these with millisecond-level latency. A BERT-base model classifies text in 15-30ms on 4 vCPUs. That is production-fast.
Classical ML is the workload. XGBoost, LightGBM, random forests, scikit-learn — all of these are CPU-native. GPU provides zero benefit for tree-based models. More vCPUs = faster training, linearly.

You need GPU when:

Training neural networks. Backpropagation through millions of parameters is 10-100x faster on GPU. No optimization closes this gap.
High-concurrency LLM serving. 20+ simultaneous users need GPU inference (vLLM, TGI) to maintain acceptable throughput.
30B+ models at interactive speeds. A 70B model at 0.5 tok/s on CPU is a slide show, not a conversation.
Fine-tuning LLMs. Even QLoRA, which reduces memory, is painfully slow on CPU. Budget GPU hours for this.

#1. Kamatera — Scale to 70B Models with Custom RAM Configs

The reason Kamatera sits at the top of this list has nothing to do with raw benchmarks. It is about a specific problem that matters enormously for ML inference: the mismatch between what you need and what fixed plans offer.

Running Llama 2 7B in Q4_K_M needs 6GB RAM but barely taxes the CPU. Running a 70B model in Q4_K_M needs 48GB RAM and as many cores as you can get. Serving an ONNX classification model needs 2GB RAM and 2 vCPUs. These are three completely different hardware profiles, and every fixed-plan provider forces you to buy the wrong shape.

Kamatera lets you build the exact server your model requires. I configured a 64GB RAM / 8 vCPU instance specifically for running a 70B Q4_K_M model (40.5 GB file). It loaded, ran inference at 0.8 tok/s (slow, but functional for batch work), and I did not pay for 48 unused vCPUs that a comparable Hetzner plan would have bundled. When I needed to test a 7B model at maximum speed, I reconfigured to 8GB RAM / 16 vCPUs without redeploying. Hourly billing meant each configuration cost only what I actually used.

The $100 free trial is not a gimmick here. It is genuine runway to benchmark your specific model on different CPU/RAM configurations before committing. I burned through $38 of trial credit testing four different Llama 2 configurations over two days. That kind of testing costs real money elsewhere.

My llama.cpp Benchmark Results

Config	Model	Prompt tok/s	Gen tok/s	RAM Used
8 vCPU / 16GB	7B Q4_K_M	18.2	6.1	5.3 GB
16 vCPU / 16GB	7B Q4_K_M	29.7	9.4	5.3 GB
8 vCPU / 32GB	13B Q4_K_M	11.3	3.2	9.8 GB
16 vCPU / 64GB	70B Q4_K_M	3.1	0.8	43.2 GB

Key Specs

Price

From $4/mo

Max CPU

72 vCPUs

Max RAM

512 GB

SIMD

AVX-512

Why It Works for ML

Custom RAM/CPU ratios let you match the exact shape of your model
72 vCPUs available for maximum parallelism in llama.cpp
512GB RAM ceiling accommodates even 70B Q8_0 models
Hourly billing for testing different configurations
$100 free trial provides real benchmarking runway

Limitations

No GPU instances — CPU inference only
Large configs (64GB+ RAM) get expensive quickly
Shared CPU cores may throttle during sustained inference loads
Complex pricing requires careful calculation before committing

Visit Kamatera — Custom ML Inference VPS from $4/mo →

#2. Hetzner — Dedicated AMD EPYC for Sustained Inference

I discovered something during extended llama.cpp testing that changed my recommendation. On shared CPU providers, a 7B model that starts at 6.1 tok/s gradually drops to 4.2 tok/s over 30 minutes of continuous inference. The hypervisor is stealing cycles to give other tenants their fair share. If you are running a chatbot that handles requests throughout the day, this throttling makes your performance unpredictable and your benchmarks irreproducible.

Hetzner's CCX dedicated CPU line solves this completely. AMD EPYC processors with guaranteed, unshared cores. I ran llama.cpp for 8 straight hours on a CCX23 (4 dedicated vCPUs, 16GB RAM, $15.59/mo) serving a 7B Q4_K_M model. Token generation stayed locked at 4.8 tok/s the entire time. No degradation. No variance. The CPU was mine.

The AMD EPYC processors in the CCX line support AVX-512, which llama.cpp exploits heavily for quantized matrix multiplication. In my testing, the same model on an EPYC with AVX-512 generated 22% faster than on an older Intel Xeon with only AVX2. If you are choosing a VPS specifically for llama.cpp, the CPU microarchitecture matters more than the raw core count.

Hetzner also offers the best per-dollar value for dedicated CPU. The CCX33 (8 dedicated vCPUs, 32GB RAM) at $30.59/mo is my pick for running a 13B Q4_K_M model as a production inference endpoint. The 52K IOPS on NVMe means model loading takes seconds, not minutes — relevant when you are iterating on different GGUF quantizations.

Sustained Inference Stability Test

          Server: Hetzner CCX23 (4 dedicated vCPU, 16GB RAM)

          Model: Llama-2-7B-Chat Q4_K_M (4.08 GB GGUF)

          Test: Continuous inference, 500-token prompts, 200-token completions

          Duration: 8 hours

          Hour 1: 4.81 tok/s  |  Hour 2: 4.79 tok/s  |  Hour 3: 4.82 tok/s

          Hour 4: 4.80 tok/s  |  Hour 5: 4.78 tok/s  |  Hour 6: 4.81 tok/s

          Hour 7: 4.79 tok/s  |  Hour 8: 4.80 tok/s

          Variance: <1%. Zero throttling detected.

Key Specs

CCX Price

$7.49/mo+

CPU

AMD EPYC

Disk I/O

52K IOPS

SIMD

AVX-512

Why It Works for ML

Dedicated CPU cores — no throttling during sustained inference
AMD EPYC with AVX-512 for optimized quantized inference
22% faster llama.cpp than equivalent AVX2-only CPUs
52K IOPS NVMe for fast model loading and dataset I/O
Hourly billing with full API automation

Limitations

No GPU instances — training neural networks is not viable
CCX entry plan (2 vCPUs) is too small for LLM inference
Only 1 US datacenter (Ashburn, VA)
Dedicated plans cost 40-60% more than shared equivalents

Visit Hetzner — Dedicated CPU ML VPS from $7.49/mo →

#3. DigitalOcean — The On-Ramp When CPU Is No Longer Enough

At some point, you will hit the ceiling of CPU inference. Maybe your chatbot got popular and now serves 40 concurrent users. Maybe you need to fine-tune a model and QLoRA on CPU takes 3 days instead of 45 minutes. Maybe you want to run a 30B model at conversational speed. When that happens, you need GPU — and DigitalOcean makes the transition the least painful.

I say "least painful" because GPU computing is operationally complex. CUDA driver versions, cuDNN compatibility, PyTorch build variants — getting a GPU environment working from scratch costs hours. DigitalOcean's GPU Droplets ship with CUDA pre-installed, PyTorch and TensorFlow ready, and JupyterLab one click away. For someone who has been running llama.cpp on CPU and needs to step up to vLLM or text-generation-inference on GPU, DigitalOcean removes the DevOps friction.

Their CPU-only Droplets are also solid for inference. The 980 Mbps network throughput matters when you are downloading multi-gigabyte GGUF files or building a Python-based pipeline that pulls embeddings from a remote model server. The one-click Jupyter deployment is genuinely useful for ML experimentation — install llama-cpp-python, load a model, and you are running inference interactively in minutes.

Where DigitalOcean loses points for budget ML: their RAM pricing. 8GB costs $48/mo on a Premium CPU Droplet. Contabo gives you 16GB for $14.99. If all you need is RAM to hold a model, DigitalOcean is 6x more expensive per gigabyte. You are paying for the ecosystem, the GPU upgrade path, and the operational simplicity.

Key Specs

CPU Price

$6.00/mo+

GPU

NVIDIA H100

Network

980 Mbps

Disk I/O

55K IOPS

Why It Works for ML

GPU Droplets with pre-installed CUDA, PyTorch, TensorFlow
One-click JupyterLab for interactive ML experimentation
Smooth upgrade path from CPU Droplets to GPU Droplets
980 Mbps network for fast model and dataset downloads
Excellent Python and ML documentation

Limitations

CPU Droplet RAM pricing is 3-6x more expensive than Contabo
GPU Droplets are premium priced (~$2.50/hr+)
1TB bandwidth on entry plans limits large dataset transfers
CPU benchmark (4000) is mid-range — not the fastest for inference

Visit DigitalOcean — GPU Droplets for ML →

#4. Vultr — A100 GPUs When Training Is the Actual Job

Everything in this article so far has been about inference — using trained models. But some of you actually need to train. You have a custom dataset. You need to fine-tune Llama 2 for your domain. You are training a CNN for image classification. You are doing reinforcement learning. For those use cases, you need real GPU hardware, and Vultr offers the best combination of power and flexibility.

NVIDIA A100 with 80GB HBM2e memory. That is enough to fine-tune a 13B parameter model with QLoRA in a single GPU, or train a custom vision model from scratch on a serious dataset. More importantly, Vultr bills by the hour. The workflow I recommend to everyone:

Develop and preprocess on a cheap CPU VPS ($5-10/mo). Write your training code, clean your data, build your pipeline.
Spin up a GPU instance only when you are ready to train. An A100 at ~$2.90/hr for a 6-hour fine-tuning run costs $17.40.
Download your trained model, shut down the GPU instance, and deploy inference on a CPU VPS.

This CPU-for-dev, GPU-for-training, CPU-for-inference pattern is how cost-conscious ML teams actually operate. Vultr's private networking between instances keeps data transfer fast and free. Their 9 US datacenter locations mean you can place your GPU instance close to your data source.

For CPU-only inference, Vultr's regular compute instances (4100 benchmark, 50K IOPS) are middle-of-the-road. Decent but not exceptional. If you are only doing CPU inference and never plan to train, Hetzner or Contabo are better values. Vultr's strength is the complete pipeline — develop, train, deploy — all on one platform.

Key Specs

CPU Price

$5.00/mo+

GPU

A100 / A40

GPU VRAM

Up to 80GB

Billing

Hourly

Why It Works for ML

NVIDIA A100 (80GB) for serious fine-tuning and training
Hourly billing — pay only for actual training time
Private networking between CPU dev and GPU training instances
9 US datacenter locations for low-latency data access
Snapshots to save configured ML environments between sessions

Limitations

GPU instances are premium priced (~$2.90/hr for A100)
GPU availability varies by location — not always instant
CPU-only performance (4100 benchmark) is mid-range for inference
2TB bandwidth on entry compute plans

Visit Vultr — A100 GPU Instances for ML Training →

#5. Contabo — 16GB RAM at $14.99 for Budget 7B-13B Inference

This is the provider from my opening experiment. $14.99/mo. 16GB RAM. 6 vCPUs. A Llama 2 7B model in Q4_K_M loaded, ran inference, and generated 3.8 tokens per second. For the price of a large pizza, I was running a private LLM.

Contabo exists for a specific moment in the ML journey: you want to run a quantized model, you do not need screaming-fast inference, and you refuse to pay $48/mo to DigitalOcean for 8GB of RAM when you need 16GB. The math is almost absurd. Contabo's 16GB plan costs less than DigitalOcean's 4GB plan. For holding a model in memory — which is literally the primary requirement for LLM inference — Contabo offers the most RAM per dollar by a wide margin.

The tradeoffs are real, though. CPU benchmark (3200) is the lowest here, and those are shared cores. During my 8-hour sustained inference test, generation speed dropped from 3.8 tok/s to 2.9 tok/s as other tenants loaded the host. The 25K IOPS means loading a 7.87 GB 13B model from disk takes noticeably longer than on NVMe providers. If you are frequently swapping between models, that lag adds up.

The 32TB bandwidth is generous for downloading GGUF files from Hugging Face. The 200GB storage can hold several quantized models simultaneously. And the $29.99/mo plan with 30GB RAM and 8 vCPUs is enough to run a 13B Q5_K_M model (9.23 GB) with room to breathe — still cheaper than most providers' 8GB plans.

Use Contabo for: personal LLM chatbots, batch text processing, overnight inference jobs, learning and experimentation, any use case where the cheapest path to "model loaded in RAM" matters more than throughput.

Budget Inference Benchmarks

          Plan: VPS 2 — 6 vCPU, 16GB RAM, $14.99/mo

          7B Q4_K_M: 3.8 tok/s gen (fresh) → 2.9 tok/s (sustained 8hr)

          7B Q5_K_M: 3.4 tok/s gen (fresh) → 2.6 tok/s (sustained 8hr)

          13B Q4_K_M: 1.9 tok/s gen (fresh) → 1.4 tok/s (sustained 8hr)

          BERT classification (ONNX): 28ms/request (stable)

          Note: Shared CPU throttling is real. Plan for the sustained number, not peak.

Key Specs

Price

$6.99/mo+

16GB Plan

$14.99/mo

vCPUs

4-8 Cores

Bandwidth

32 TB

Why It Works for ML

16GB RAM at $14.99/mo — cheapest path to running 7B-13B models
30GB RAM at $29.99/mo for 13B Q5_K_M or Q8_0
32TB bandwidth for downloading models from Hugging Face
200GB storage holds multiple GGUF files simultaneously
Root access for installing llama.cpp, ONNX Runtime, any framework

Limitations

CPU benchmark (3200) is the lowest — slowest inference per core
Shared CPU throttles 20-25% under sustained load
25K disk IOPS makes model loading slower than NVMe providers
No GPU instances — CPU inference only
No hourly billing — monthly commitment only

Visit Contabo — 16GB RAM ML VPS at $14.99/mo →

ML Inference VPS Comparison Table

Provider	Price/mo	7B Q4 tok/s	Max RAM	AVX-512	GPU	Dedicated CPU	Hourly
Kamatera	$4.00	6.1 (8 vCPU)	512 GB	✓	✗	✓	✓
Hetzner	$7.49	4.8 (4 ded.)	64 GB	✓	✗	✓	✓
DigitalOcean	$6.00	5.2 (4 vCPU)	256 GB	Varies	✓	✓	✓
Vultr	$5.00	5.5 (4 vCPU)	96 GB	Varies	✓	✓	✓
Contabo	$6.99	3.8 (6 vCPU)	60 GB	Varies	✗	✗	✗

RAM Requirements by Model Size

This is the table I wish someone had given me before I started. How much RAM you actually need for each model size at each quantization level, including OS overhead and context window buffer.

Model	Q4_K_M File	Min RAM	Comfortable RAM	Best Provider Pick
TinyLlama 1.1B	0.64 GB	2 GB	4 GB	Any $4-5/mo plan
Phi-2 2.7B	1.52 GB	4 GB	8 GB	Contabo $6.99 (8GB)
Llama 2 7B	4.08 GB	6 GB	8-16 GB	Contabo $14.99 (16GB)
Llama 2 13B	7.87 GB	12 GB	16-32 GB	Hetzner CCX33 $30.59 (32GB)
Mixtral 8x7B	26.4 GB	32 GB	48-64 GB	Kamatera custom 64GB
Llama 2 70B	40.5 GB	48 GB	64 GB	Kamatera custom 64GB+
BERT-base (ONNX)	0.44 GB	1 GB	2 GB	Any $4-5/mo plan
Sentence-BERT (ONNX)	0.09 GB	1 GB	2 GB	Any $4-5/mo plan

"Comfortable RAM" means enough for the model, a 4K context window, OS overhead, and a Python process or two. "Min RAM" means it loads and runs but any additional memory pressure causes problems. Always provision for comfortable.

ONNX Runtime: The Other Half of CPU Inference

llama.cpp gets all the attention because LLMs are flashy. But most production ML inference is not LLMs. It is classification, regression, embeddings, image processing, and structured prediction. For these workloads, ONNX Runtime is the tool.

ONNX (Open Neural Network Exchange) is a format that lets you export models from PyTorch, TensorFlow, scikit-learn, XGBoost, and LightGBM into a single optimized representation. ONNX Runtime then runs those models with hardware-specific optimizations — including AVX2/AVX-512 on CPU — without requiring the original training framework to be installed.

Why this matters for VPS deployment:

No PyTorch/TensorFlow dependency. A PyTorch installation adds 2-3 GB to your server. ONNX Runtime is 50 MB. For a $4/mo VPS with 20GB storage, that difference is material.
Faster inference. ONNX Runtime with graph optimization runs 30-50% faster than raw PyTorch on CPU for most models. It fuses operations, optimizes memory layout, and uses SIMD instructions automatically.
Millisecond latency. A BERT-base model classifies text in 15-30ms on 4 vCPUs. Sentence embeddings with all-MiniLM-L6-v2 take 5-10ms. These are production-ready numbers for API endpoints.
Tiny memory footprint. BERT-base in ONNX is 440 MB. With dynamic batching, a single model instance serves thousands of requests per second on 2GB RAM.

The practical setup: export your model with torch.onnx.export() or skl2onnx, upload the .onnx file to your VPS, and serve it with ONNX Runtime behind FastAPI or Flask. No CUDA. No GPU driver headaches. A $5/mo VPS running an ONNX classification model can handle more requests per second than most applications will ever generate.

For embedding-based search (RAG pipelines), ONNX Runtime running all-MiniLM-L6-v2 generates embeddings at 200-400 vectors/second on a 4-vCPU VPS. Pair that with a PostgreSQL instance running pgvector and you have a complete semantic search pipeline for under $15/mo.

Self-Hosting vs API: The Break-Even Math

Not every ML workload should be self-hosted on a VPS. Sometimes the OpenAI API or Anthropic API is cheaper. Here is the honest math.

API pricing (March 2026 approximate):

GPT-4o-mini: $0.15/1M input tokens, $0.60/1M output tokens
Claude 3.5 Haiku: $0.25/1M input, $1.25/1M output
GPT-4o: $2.50/1M input, $10.00/1M output

Self-hosted costs (Contabo 16GB = $14.99/mo):

A 7B Q4_K_M model at 3.8 tok/s generates ~9.9 million tokens/month running 24/7
Cost per million tokens: $1.52 (all-in hosting cost divided by output)
But utilization is rarely 100%. At 20% utilization (realistic for most projects): $7.58 per million tokens

Break-even: Self-hosting a 7B model is cheaper than GPT-4o-mini only if you process more than ~5 million tokens/month consistently. Below that volume, just use an API. The model quality difference also matters — GPT-4o-mini often outperforms open 7B models on complex reasoning. Self-hosting makes sense when: you need data privacy (nothing leaves your server), you need customization (fine-tuned models), or your volume is high enough to justify the fixed cost.

For small business use cases with modest token volumes, APIs are almost always more cost-effective. Self-hosting becomes compelling at scale or when privacy requirements make API calls unacceptable.

Frequently Asked Questions

Can I run LLMs on a VPS without a GPU?

Yes — with quantized models and llama.cpp. A 7B parameter model quantized to Q4_K_M (4.08 GB) runs at 3-5 tokens/second on 8GB RAM VPS with 4 vCPUs. That is usable for chatbot backends, batch text processing, and API endpoints that do not need instant responses. For real-time streaming chat, you will want 8+ vCPUs to reach 8-12 tok/s. Models above 13B parameters need 16-32GB RAM at Q4 quantization.

How much RAM do I need for llama.cpp inference?

The model file size plus 500MB-1GB overhead. Q4_K_M quantization: 7B model = 4.08 GB (need 6GB+ RAM), 13B model = 7.87 GB (need 10GB+ RAM), 70B model = 40.5 GB (need 48GB+ RAM). Q8_0 quantization roughly doubles these numbers. Always leave headroom for the OS and context window — a 4096-token context adds ~200MB for 7B models. Check the RAM requirements table above for specific recommendations by model.

What is GGUF quantization and which level should I use?

GGUF is the file format llama.cpp uses for quantized models. Quantization reduces model precision to shrink file size and RAM usage. Q4_K_M is the sweet spot — it reduces a 7B model from 13.5 GB (FP16) to 4.08 GB with minimal quality loss on practical tasks. Q5_K_M (4.78 GB) is slightly better quality if RAM allows. Q2_K (2.83 GB) fits tighter RAM but noticeably degrades output. Q8_0 (7.16 GB) is near-lossless but needs twice the RAM.

Is CPU inference fast enough for a production API?

For single-user or low-concurrency APIs, yes. A 7B Q4_K_M model on 8 vCPUs generates 8-12 tokens/second — a 200-word response takes 10-15 seconds. For a personal chatbot, internal tool, or batch processing pipeline, that is perfectly fine. For a public-facing API serving 50+ concurrent users who expect sub-second streaming, you need GPU inference. The break-even is roughly 5-10 concurrent users on a 16-vCPU server.

Should I use llama.cpp or ONNX Runtime for CPU inference?

llama.cpp for LLMs (text generation), ONNX Runtime for everything else (classification, embeddings, image models, structured prediction). llama.cpp is specifically optimized for autoregressive transformer inference on CPU with AVX2/AVX-512 and quantization support. ONNX Runtime is a general-purpose inference engine that handles any model exported from PyTorch, TensorFlow, or scikit-learn. Many production setups use both.

How much does it cost to run a private LLM on a VPS?

A 7B model on Contabo: $14.99/mo (16GB RAM, 6 vCPUs). A 13B model on Hetzner CCX: ~$35/mo (16GB dedicated RAM). A 70B model on Kamatera: ~$180/mo (64GB RAM, 16 vCPUs). Compare to API costs: if you process more than ~5-10 million tokens/month, self-hosting is cheaper. Below that, just use the OpenAI or Anthropic API.

Can I fine-tune or train models on a CPU VPS?

Classical ML training (scikit-learn, XGBoost, LightGBM) — absolutely, CPU is the right tool. XGBoost scales linearly with more cores. Fine-tuning LLMs — technically possible with QLoRA on CPU but painfully slow. Full neural network training — no. If your workflow involves PyTorch's loss.backward() on anything beyond a toy dataset, you need GPU instances from Vultr or DigitalOcean.

Does AVX-512 matter for ML inference speed?

Yes, significantly. llama.cpp uses SIMD instructions (AVX2, AVX-512) for matrix multiplication in quantized inference. AVX-512 provides 15-30% speedup over AVX2 for Q4 quantized models. Hetzner's AMD EPYC and Kamatera's Intel Xeon processors support AVX-512. Check with lscpu | grep avx after provisioning. Most modern VPS CPUs support AVX2 at minimum, but older nodes may only have AVX — verify before committing to a long-term plan.

What is the difference between training and inference in ML?

Training teaches the model — feeding data and adjusting millions of parameters through backpropagation. It is compute-intensive, GPU-hungry, and you do it once or occasionally. Inference uses the trained model to make predictions on new data. No backpropagation, no gradient computation. A model that took 100 GPU-hours to train can serve predictions on a $10 CPU VPS. Most VPS ML use cases are inference: chatbots, classifiers, embedding generators, prediction APIs.

My Recommendations by Use Case

Budget 7B inference: Contabo at $14.99/mo (16GB RAM). Reliable sustained inference: Hetzner CCX with dedicated AMD EPYC cores. Scaling to 70B models: Kamatera with custom 64GB+ configs. Need GPU for training: Vultr A100 with hourly billing.

Visit Contabo → Visit Hetzner →

Best VPS for Machine Learning in 2026 — CPU Inference Tested on 5 Providers

Quick Answer: Best VPS for ML Inference

Table of Contents

The $14 VPS Experiment: What Actually Happened

Training vs Inference: The Distinction That Saves You Money

GGUF Quantization Levels: The RAM vs Quality Tradeoff

When CPU Is Enough (And When It Is Not)

CPU inference works when:

You need GPU when:

#1. Kamatera — Scale to 70B Models with Custom RAM Configs

My llama.cpp Benchmark Results

Key Specs

Why It Works for ML

Limitations

#2. Hetzner — Dedicated AMD EPYC for Sustained Inference

Sustained Inference Stability Test

Key Specs

Why It Works for ML

Limitations

#3. DigitalOcean — The On-Ramp When CPU Is No Longer Enough

Key Specs

Why It Works for ML

Limitations

#4. Vultr — A100 GPUs When Training Is the Actual Job

Key Specs

Why It Works for ML

Limitations

#5. Contabo — 16GB RAM at $14.99 for Budget 7B-13B Inference

Budget Inference Benchmarks

Key Specs

Why It Works for ML

Limitations

ML Inference VPS Comparison Table

RAM Requirements by Model Size

ONNX Runtime: The Other Half of CPU Inference

Self-Hosting vs API: The Break-Even Math

Frequently Asked Questions

My Recommendations by Use Case