RAM Aggregator
The decentralized AI inference network powered by Apple Silicon.
RAM Aggregator is a peer-to-peer network that connects people who want to use AI (users) with people who have spare compute (providers). Users pay with xRAM tokens to run LLM inference; providers earn xRAM by sharing their Mac's idle processing power.
The entire system runs on three pillars:
- Coordinator — The central routing layer that matches jobs to providers, manages the token ledger, and exposes an OpenAI-compatible API. Includes a shard scheduler for automatically splitting large models across multiple Macs.
- Provider Nodes — macOS menubar apps that connect to the coordinator via WebSocket, download models from HuggingFace, and run inference locally using Apple's MLX framework. Providers can serve full models or individual shards as part of a distributed pipeline.
- xRAM Token — An ERC-20 token on Base (Coinbase's L2) that powers all payments and rewards. 100M fixed supply with a Bitcoin-style halving emission schedule.
Highlights
- Pipeline parallelism — Models too large for a single Mac are automatically split across multiple providers and run as a distributed pipeline.
- Smart model routing — The coordinator intelligently selects providers that already have models loaded, falling back to lazy-loading when necessary.
- Native chat templates — Each model receives its native chat format (Llama, Qwen, Mistral) for accurate, clean responses with no prompt leakage.
- Non-blocking inference — MLX inference runs in a dedicated thread, keeping WebSocket connections alive during long generation runs (40s+).
- Provider controls — RAM allocation slider and Prevent Sleep toggle let providers manage exactly how much of their machine to share.
- Auto-update system — Providers receive over-the-air updates pushed from the coordinator, keeping the fleet in sync.
Quickstart: Users
Start chatting with AI in under a minute.
Option 1: Free Demo
Visit the chat app and start typing. The app uses a shared demo key (xram_free_test) that gives you a handful of free messages per day to try the network.
Option 2: Connect MetaMask (Unlimited)
- Open the chat app and click Connect Wallet.
- Switch MetaMask to the Base network (the app will prompt you).
- Click Deposit xRAM and choose an amount. Your xRAM tokens are transferred to the escrow smart contract.
- A session token is issued automatically. You can now chat with any model, paying per token from your deposit.
- When you're done, click Withdraw to reclaim your unused balance back to your wallet.
Option 3: OpenAI SDK (for Developers)
Point any OpenAI-compatible client at the RAM Aggregator API:
from openai import OpenAI
client = OpenAI(
base_url="https://ramaggregator.com/v1",
api_key="your_session_token_here" # sess_... from deposit
)
response = client.chat.completions.create(
model="llama-3.2-3b",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
Quickstart: Providers
Turn your Mac into an AI inference node and start earning xRAM tokens.
Requirements
- macOS with Apple Silicon (M1/M2/M3/M4)
- At least 8GB RAM (more RAM = more models you can serve)
- Stable internet connection
- An Ethereum wallet address (for receiving xRAM rewards)
Installation
- Download the RAM Aggregator macOS app (DMG installer).
- Drag it to Applications and launch. A chip icon appears in your menu bar.
- The app creates a Python virtual environment and installs dependencies automatically.
Configuration
- Click the menubar icon and select Rename Node to set a unique name for your provider.
- Open the Models submenu and click Enable on any models you want to serve.
- Models that exceed your system RAM are automatically greyed out and non-selectable.
- The model will download from HuggingFace (one-time) and then show as [ready].
- Use the RAM Allocation slider to control how much of your system memory is available for models (e.g., cap at 16GB if you want to keep resources free for other work).
- Enable Prevent Sleep if you want your Mac to stay awake and serve jobs while unattended.
Running
Click Start Daemon from the menubar. Your node connects to the coordinator, reports its available models, and begins accepting inference jobs. The status indicator turns green when you're online.
You can enable multiple models simultaneously (RAM permitting). The coordinator routes jobs to you based on which models you have loaded, your available memory, and your reputation score.
If the coordinator requests it, your node can also serve as a shard worker for pipeline parallelism — loading a portion of a large model and processing hidden states as part of a distributed pipeline. This happens automatically; no extra configuration needed.
Earning
Every time you complete an inference job, xRAM tokens are minted on-chain to your wallet address. The amount depends on the current emission epoch, the number of tokens generated, and your staking multiplier. You can check your earnings from the menubar's dashboard link.
Connection Resilience
The provider daemon automatically reconnects if the WebSocket connection drops. On reconnection, stale registrations are cleaned up so the coordinator always shows an accurate count of active providers. Inference runs in a dedicated thread so that long generation jobs (40+ seconds for large models) don't block WebSocket keepalive pings.
Uninstalling Models
To remove a model: open the Models submenu, expand the model, and click Uninstall (delete files). This removes the model from disk, HuggingFace cache, and memory.
System Architecture
RAM Aggregator is a hub-and-spoke architecture with a central coordinator and distributed provider nodes. It supports two inference modes: single-provider (for models that fit in one Mac's memory) and pipeline-parallel (for models that need to be split across multiple Macs).
[Users / AI Agents]
|
| HTTPS (OpenAI-compatible API)
v
[Coordinator + Shard Scheduler] <---> [Base L2 Smart Contract]
| (xRAM ERC-20)
| WebSocket
|
┌────┴─────────────────────────────────┐
| |
| Single Provider Pipeline |
| (fits in 1 Mac) Parallelism |
| (split model) |
v v
[Provider 1] [Shard 0] ──TCP──> [Shard 1] ──TCP──> [Shard 2]
M3 Max, 64GB Mac #1 Mac #2 Mac #3
Qwen3 8B (full) Layers 0-10 Layers 11-21 Layers 22-32
Coordinator
The coordinator is a FastAPI server deployed on Fly.io. It handles:
- Job routing and dispatch — Matches requests to the best provider, preferring those with models already loaded in memory.
- Shard scheduling — Automatically splits large models across multiple providers when no single node has enough RAM.
- Token ledger and emission — Manages the xRAM reward schedule and on-chain minting.
- Provider registry — Tracks heartbeats, deduplicates reconnections, and maintains accurate provider counts.
- OpenAI-compatible gateway — Exposes
/v1/chat/completionswith native chat template formatting for each model family. - E2E encryption relay — Forwards encrypted blobs without decryption for private inference.
Provider Nodes
Each provider runs a macOS menubar app that maintains a persistent WebSocket connection to the coordinator. Providers report their hardware specs, loaded models, and availability via heartbeats every 10 seconds.
When a job arrives, the provider runs inference using Apple's MLX framework. Inference executes in a dedicated thread (asyncio.to_thread) so the asyncio event loop stays responsive for WebSocket pings and heartbeats, even during long generation runs. Prompts are formatted using each model's native chat template (e.g., <|im_start|> for Qwen3, <|begin_of_text|> for Llama) to ensure clean, accurate output.
Providers can also act as shard workers for pipeline parallelism. When the coordinator assigns a shard, the provider loads only the specified transformer layers and listens on a TCP port for hidden state data from the previous shard in the pipeline.
On-Chain Layer
The xRAM ERC-20 token is deployed on Base mainnet. The coordinator holds a signer key that can mint tokens from the treasury allocation. User deposits are handled via an escrow wallet with approve/transferFrom for withdrawals.
Job Flow
Here's what happens when a user sends a chat message:
- Request — User sends a chat message via the web UI or OpenAI API. The gateway validates their API key or session token.
- Model Resolution — The gateway resolves the model alias (e.g.,
qwen3-8b) to the full HuggingFace ID and formats the prompt using the model's native chat template. - Routing Decision — The coordinator checks if any single provider can serve the model. If not, it checks for an existing pipeline shard group or triggers automatic shard scheduling.
- Dispatch (Single Provider) — For models that fit in one Mac, the dispatcher selects the best provider: it prefers those with the model already loaded, then falls back to any online provider with enough RAM.
- Dispatch (Pipeline) — For models split across multiple Macs, the coordinator routes the request to the pipeline orchestrator, which sends hidden states through each shard worker in sequence.
- Inference — The provider runs inference through MLX in a dedicated thread (keeping the WebSocket alive), applies the model's native chat template, strips any internal reasoning tags, and returns the result.
- Validation — The coordinator validates the result (output quality, timing plausibility, self-dealing, duplicate detection) before paying the reward.
- Payment — If validation passes, xRAM tokens are minted on-chain to the provider's wallet. The user's session deposit is deducted based on dynamic pricing.
Distributed Inference
Some models are too large for any single Mac. A 405B-parameter model needs roughly 250 GB of RAM — more than most machines have. RAM Aggregator solves this with pipeline parallelism: the model's transformer layers are split into shards, each shard runs on a different Mac, and hidden states flow through the pipeline via TCP.
How Pipeline Parallelism Works
A transformer model is a stack of identical layers. If a model has 60 layers and three Macs are available, the coordinator splits it into three shards:
Mac #1 (Shard 0): Layers 0–19 + Embedding layer
Mac #2 (Shard 1): Layers 20–39
Mac #3 (Shard 2): Layers 40–59 + Output head (logits)
Flow for each token:
[Prompt] → Embed → Shard 0 → TCP → Shard 1 → TCP → Shard 2 → Logits → [Token]
Each shard worker loads only its assigned layers into memory. The orchestrator (running on the coordinator or the first shard) manages the autoregressive generation loop: it tokenizes the input, computes embeddings, sends hidden states through the pipeline, collects logits from the final shard, samples the next token, and repeats.
Automatic Shard Scheduling
When a user requests a model that no single provider can serve, the coordinator's shard scheduler takes over:
- It queries the model's layer count and estimates RAM per shard.
- It finds online providers with enough free memory to hold at least one shard.
- It assigns shards to providers and sends
LOAD_SHARDcommands via WebSocket. - Each provider downloads the model (if needed), loads its assigned layers, and starts a TCP shard worker.
- When all shards report
SHARD_READY, the pipeline is marked as complete and ready for inference.
This entire process is transparent to the user. They request llama-3.1-405b and get a response — they don't need to know it was split across four Macs.
Shard Worker Architecture
Each shard worker is a lightweight TCP server that handles three message types:
- FORWARD_PASS — Receives serialized hidden states, runs them through its transformer layers, and returns the output. If it's the last shard, it applies the final layer norm and output projection to produce logits.
- KV_CACHE_INIT — Initializes the key-value cache for efficient autoregressive generation.
- HEARTBEAT — Confirms the shard is alive and responsive.
Hidden states are serialized as compact binary arrays (MLX arrays → numpy → bytes) with shape/dtype metadata. The protocol uses length-prefixed JSON headers for routing and raw binary payloads for tensor data.
Smart Routing
The coordinator makes intelligent decisions about where to send each inference request. Here's the priority order:
Provider Selection
- Model already loaded — The coordinator prefers providers that have the requested model loaded and ready in memory. This avoids model loading time (which can be 10–60 seconds for larger models).
- Preferred provider — If the client specifies a
preferred_providerID (e.g., for testing or affinity), the coordinator routes to that provider if it's online. - Best available — Among providers with the right model, the coordinator picks the one with the highest
available_memory × reputationscore. - Lazy-load fallback — If no provider has the model loaded, any online provider with enough RAM is selected. The model will be downloaded and loaded on first use.
- Pipeline fallback — If no single provider has enough RAM, the shard scheduler splits the model across multiple providers.
Chat Template Formatting
Different model families require different prompt formats. The coordinator sends prompts in a generic System: ... / User: ... / Assistant: format, and the provider's inference engine applies the correct chat template using the tokenizer's built-in apply_chat_template function. This ensures each model sees its native token format:
| Model Family | Template Style | Special Handling |
|---|---|---|
| Llama 3.x | <|begin_of_text|> + role headers | Standard instruct format |
| Qwen 2.5 / 3 | <|im_start|> ChatML-style | Qwen3: /no_think directive to suppress chain-of-thought |
| Mistral | [INST] markers | Standard instruct format |
| DeepSeek | ChatML-style | <think> tags stripped from output |
If a model includes internal reasoning in <think>...</think> tags, the provider automatically strips these before returning the response, so users see only the final answer.
Model Catalog
All models are 4-bit quantized MLX versions from the mlx-community HuggingFace organization.
Edge & Small (8–16 GB RAM)
| Model | Params | RAM | Speed | Best For |
|---|---|---|---|---|
| Llama 3.2 1B Instruct | 1B | ~1.5 GB | ~150 tok/s | Simple tasks, fast replies |
| Llama 3.2 3B Instruct | 3B | ~3 GB | ~100 tok/s | Balanced speed & quality |
| Mistral 7B Instruct v0.3 | 7B | ~5.5 GB | ~60 tok/s | Reasoning, instruction following |
| Qwen 2.5 7B Instruct | 7B | ~5.5 GB | ~60 tok/s | Multilingual, coding |
| Qwen3 8B | 8B | ~5.5 GB | ~55 tok/s | Latest gen, built-in thinking mode |
| Llama 3.1 8B Instruct | 8B | ~6 GB | ~50 tok/s | General purpose |
Medium (16–64 GB RAM)
| Model | Params | RAM | Speed | Best For |
|---|---|---|---|---|
| Qwen3 30B MoE NEW | 30B (3B active) | ~18 GB | ~90 tok/s | Smart like 30B, fast like 3B |
| Qwen3 Coder 30B MoE NEW | 30B (3B active) | ~18 GB | ~90 tok/s | Sonnet-class agentic coding |
| Qwen 2.5 Coder 32B Instruct | 32B | ~20 GB | ~25 tok/s | Code generation, 80+ languages |
| DeepSeek R1 Distill 32B | 32B | ~20 GB | ~25 tok/s | Chain-of-thought reasoning |
| Llama 3.3 70B Instruct | 70B | ~42 GB | ~12 tok/s | Latest Llama, best quality/size |
| Llama 3.1 70B Instruct | 70B | ~42 GB | ~10 tok/s | Proven workhorse |
| Qwen 2.5 72B Instruct | 72B | ~45 GB | ~10 tok/s | GPT-4 class multilingual |
Large (96–192 GB RAM)
| Model | Params | RAM | Speed | Best For |
|---|---|---|---|---|
| Mistral Large 2 | 123B | ~75 GB | ~6 tok/s | Mistral flagship, complex tasks |
| Qwen3 235B MoE NEW | 235B (22B active) | ~135 GB | ~30 tok/s | Thinking mode, 119 languages |
XL (192–384 GB RAM)
| Model | Params | RAM | Speed | Best For |
|---|---|---|---|---|
| Qwen3.5 397B MoE NEW | 397B (17B active) | ~225 GB | ~35 tok/s | Latest Qwen flagship, hybrid DeltaNet, 201 languages |
| Llama 3.1 405B Instruct | 405B | ~250 GB | ~3 tok/s | Largest open dense model |
| Qwen3 Coder 480B MoE NEW | 480B (35B active) | ~280 GB | ~18 tok/s | Ultimate coding, frontier-class |
Ultra (400+ GB RAM)
| Model | Params | RAM | Speed | Best For |
|---|---|---|---|---|
| DeepSeek R1 671B | 671B MoE | ~400 GB | ~5 tok/s | Ultimate reasoning (37B active) |
| DeepSeek V3 671B | 671B MoE | ~400 GB | ~5 tok/s | Flagship general-purpose MoE |
Provider Menubar App
The RAM Aggregator provider app lives in your macOS menu bar. It's a native Swift app that manages a Python daemon under the hood. Everything is controlled from the menu bar icon — no terminal required.
What You See
The menu bar icon (a chip symbol) shows your connection status at a glance. Click it to access:
- Status indicator — Green when connected and serving jobs, grey when offline.
- Start / Stop Daemon — One click to go online or offline.
- Models submenu — Browse the full model catalog, enable/disable models, download new ones, and uninstall models you no longer need. Models that exceed your available RAM are automatically greyed out.
- Rename Node — Set a friendly name for your provider that appears on the network dashboard.
- Wallet address — Your Ethereum address for receiving xRAM rewards.
- Dashboard link — Opens the web dashboard showing your earnings, job history, and reputation score.
Under the Hood
When you click Start Daemon, the app launches a Python process that:
- Loads all enabled models into memory using MLX.
- Connects to the coordinator via a persistent WebSocket.
- Registers its hardware specs, loaded models, and encryption public key.
- Sends heartbeats every 10 seconds to stay in the provider registry.
- Accepts inference jobs, runs them through MLX, and returns results.
All configuration is stored in ~/.ram-aggregator/config.json. Model weights are cached in the standard HuggingFace cache directory.
Provider Controls & Settings
RAM Allocation
The RAM Allocation slider lets you control exactly how much memory RAM Aggregator can use for models. This is useful if you want to keep some RAM free for other applications while still contributing to the network.
When you lower the RAM limit, models that exceed the new cap are automatically greyed out in the model selector. The daemon reports the capped memory to the coordinator, which takes it into account when routing jobs. For example, if your Mac has 64 GB but you set the limit to 32 GB, you can still serve 7B–32B models comfortably without impacting your other work.
Prevent Sleep
macOS puts your Mac to sleep after a period of inactivity, which disconnects the provider from the network. The Prevent Sleep toggle keeps your Mac awake so it can serve inference jobs around the clock. This is ideal for dedicated provider setups (e.g., a Mac Mini or Mac Studio running headless).
When enabled, the app uses macOS power assertions to prevent system sleep. Display sleep still occurs normally — only system sleep is prevented. Disable the toggle to restore your normal sleep settings.
E2E Encryption Key
On first launch, the daemon generates an X25519 key pair for end-to-end encryption. The private key is stored in ~/.ram-aggregator/encryption_key.bin with restrictive file permissions (0600). The public key is included in every registration and heartbeat message, allowing clients to encrypt prompts specifically for your provider.
Auto-Update System
RAM Aggregator includes an over-the-air update mechanism that keeps providers running the latest version without manual intervention.
How It Works
- When a new version is available, the coordinator sends a
FORCE_UPDATEmessage to all connected providers via WebSocket. - The daemon writes a flag file with the target version and executes the update script.
- The update script downloads the latest daemon code, updates dependencies, and restarts the process.
- The provider reconnects to the coordinator with the new version.
Auto-update can be disabled in ~/.ram-aggregator/config.json by setting "auto_update": false. When disabled, the provider will log a warning about the available update but won't apply it automatically.
xRAM Token Overview
| Token Name | RAM Aggregator (xRAM) |
| Standard | ERC-20 |
| Chain | Base Mainnet |
| Contract | 0x3BeB23287f24Db91249D8D90aD61a0e07F4F4C5c |
| Total Supply | 100,000,000 xRAM |
| Decimals | 18 |
| Trade | Aerodrome (ETH/xRAM) |
Allocation
- 70,000,000 xRAM (70%) — Provider treasury. Distributed to providers as inference rewards via the emission schedule.
- 30,000,000 xRAM (30%) — Team treasury. For development, partnerships, and ecosystem growth.
Emission Schedule
xRAM uses a Bitcoin-inspired halving model. The 70M provider treasury is divided into 4 epochs of 17.5M tokens each. Each epoch halves the base reward rate.
| Epoch | Tokens Available | Reward Multiplier | Daily Cap |
|---|---|---|---|
| 1 (0 – 17.5M minted) | 17,500,000 | 1.0x | 5,000,000/day |
| 2 (17.5M – 35M) | 17,500,000 | 0.5x | 2,500,000/day |
| 3 (35M – 52.5M) | 17,500,000 | 0.25x | 1,250,000/day |
| 4 (52.5M – 70M) | 17,500,000 | 0.125x | 625,000/day |
Additionally, hourly emission caps prevent flash-draining. The reward per job is calculated as: base_reward * epoch_multiplier * staking_bonus, capped by both daily and hourly limits.
Dynamic Pricing
The cost of inference adjusts automatically based on network utilization, similar to Ethereum gas fees.
| Factor | Effect |
|---|---|
| Network utilization | Higher utilization → higher prices |
| Model size | 70B models cost ~10x more than 1B models |
| Tokens generated | Cost scales linearly with output length |
The pricing engine targets 60% network utilization. Prices have a floor of 0.1 xRAM/1K tokens and a ceiling of 50 xRAM/1K tokens. Current pricing can be checked via GET /api/v1/pricing.
Staking
Providers can stake xRAM tokens to increase their reward multiplier and get priority in job routing.
- Minimum stake: 1,000 xRAM
- Staking bonus: Up to 2x reward multiplier based on amount staked
- Unstaking lockup: 7-day cooldown period before tokens are returned
- Grace period: New networks have a 72-hour grace period where staking is not required to earn rewards
API Authentication
All API requests use the Authorization header with a Bearer token.
Key Types
| Type | Format | Use Case |
|---|---|---|
| Demo | xram_free_test | Free testing (rate limited, shared) |
| Session | sess_... | MetaMask deposit sessions (auto-issued) |
| Live | xram_live_... | Production keys (admin-created) |
| Agent | xram_agent_... | AI agent keys (admin-created) |
Session tokens are issued automatically when you deposit xRAM through the chat app. For programmatic access, use your session token as the API key.
Chat Completions
OpenAI-compatible chat completions endpoint. Works with any OpenAI SDK.
Request Body
{
"model": "llama-3.2-3b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"max_tokens": 256,
"temperature": 0.7,
"stream": false
}
Response
{
"id": "chatcmpl-a1b2c3...",
"object": "chat.completion",
"model": "llama-3.2-3b",
"choices": [{
"index": 0,
"message": {"role": "assistant", "content": "The capital of France is Paris."},
"finish_reason": "stop"
}],
"usage": {"prompt_tokens": 24, "completion_tokens": 8, "total_tokens": 32},
"xram_tokens_charged": 0.032,
"xram_session_remaining": 9967.5
}
Model Aliases
Use the short alias or the full HuggingFace model ID in API requests. Both work.
| Alias | Full Model ID |
|---|---|
llama-3.2-1b | mlx-community/Llama-3.2-1B-Instruct-4bit |
llama-3.2-3b | mlx-community/Llama-3.2-3B-Instruct-4bit |
llama-3.1-8b | mlx-community/Meta-Llama-3.1-8B-Instruct-4bit |
mistral-7b | mlx-community/Mistral-7B-Instruct-v0.3-4bit |
qwen-2.5-7b | mlx-community/Qwen2.5-7B-Instruct-4bit |
qwen3-8b | mlx-community/Qwen3-8B-4bit |
qwen3-30b | mlx-community/Qwen3-30B-A3B-4bit |
qwen3-coder-30b | mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit |
qwen-2.5-coder-32b | mlx-community/Qwen2.5-Coder-32B-Instruct-4bit |
deepseek-r1-distill-32b | mlx-community/DeepSeek-R1-Distill-Qwen-32B-4bit |
llama-3.3-70b | mlx-community/Llama-3.3-70B-Instruct-4bit |
llama-3.1-70b | mlx-community/Meta-Llama-3.1-70B-Instruct-4bit |
qwen-2.5-72b | mlx-community/Qwen2.5-72B-Instruct-4bit |
mistral-large-2 | mlx-community/Mistral-Large-Instruct-2407-4bit |
qwen3-235b | mlx-community/Qwen3-235B-A22B-4bit |
qwen3.5-397b | mlx-community/Qwen3.5-397B-A17B-nvfp4 |
llama-3.1-405b | mlx-community/Meta-Llama-3.1-405B-Instruct-4bit |
qwen3-coder-480b | mlx-community/Qwen3-Coder-480B-A35B-Instruct-4bit |
deepseek-r1 | mlx-community/DeepSeek-R1-4bit |
deepseek-v3 | mlx-community/DeepSeek-V3-4bit |
Models
List available models on the network. Returns models that at least one online provider has loaded, or the full catalog if no providers are online.
Enriched model listing with provider counts, real-time pricing, and performance metrics for each model.
Deposits & Sessions
After transferring xRAM to the escrow address via MetaMask, call this endpoint with the transaction hash to verify the deposit on-chain and receive a session token.
{
"tx_hash": "0xabc123...",
"wallet_address": "0xYourWalletAddress..."
}
Check your remaining deposit balance. Requires session token in Authorization header.
Close your session and withdraw remaining balance. The escrow contract is updated to allow you to call transferFrom to reclaim your tokens on-chain.
Get the escrow wallet address and contract details needed for MetaMask deposits.
E2E Encryption Endpoints
These endpoints implement the two-phase encrypted inference protocol. Use them when you want end-to-end encryption so the coordinator cannot see your prompts or responses.
List online providers with their X25519 public encryption keys. Public endpoint — no authentication required. Use this to inspect which providers support E2E before initiating a request.
Phase 1: Send job metadata (model, max_tokens, temperature) without any prompt data. The coordinator assigns a provider and returns their public key plus a job ID.
// Request
{ "model": "qwen3-8b", "max_tokens": 256, "temperature": 0.7 }
// Response
{
"job_id": "abc123...",
"provider_id": "p_xyz",
"provider_name": "Steve's MacBook",
"provider_public_key_b64": "base64-encoded X25519 public key"
}
Submit the encrypted prompt for a previously initialized job. The coordinator forwards the encrypted blob to the assigned provider without decryption.
// Request
{
"job_id": "abc123...",
"encrypted_prompt_b64": "base64-encoded AES-256-GCM ciphertext",
"prompt_nonce_b64": "base64-encoded 96-bit nonce",
"client_public_key_b64": "base64-encoded ephemeral X25519 public key"
}
// Response (encrypted)
{
"encrypted": true,
"encrypted_output_b64": "base64-encoded ciphertext",
"output_nonce_b64": "base64-encoded nonce",
"xram_job_id": "abc123...",
"xram_tokens_charged": 0.15
}
Providers
List online providers with hardware specs, loaded models, and reputation scores. Financial details (earnings, wallet addresses) require authentication.
Network-wide statistics: total providers, online count, available memory, and tokens minted.
Current dynamic pricing state: utilization, price multiplier, and base rates.
Get a price quote for a specific model and token count. Parameters: model (e.g., "7B"), max_tokens (e.g., 256).
Privacy Overview
RAM Aggregator is a decentralized network. Understanding who can see what is essential to using it with confidence. This section explains exactly what data is collected, who can access it, and how our encryption features protect you.
The Three Parties
Every inference request involves three parties. Each has different levels of data access:
- You (the User) — You control your prompts, your wallet, and whether to enable end-to-end encryption.
- The Coordinator — The central routing server. It matches your request to a provider and manages billing. Think of it like a postal service: it needs to know the destination, but doesn't need to read the letter.
- The Provider — The Mac running your inference. The provider must see your prompt to generate a response — this is fundamental to how AI inference works. You cannot ask a model to answer a question without the model seeing the question.
Transport Security
All connections use TLS 1.3 (HTTPS / WSS). This protects against eavesdroppers on the network (ISPs, Wi-Fi snoopers, etc.). TLS encrypts data in transit between you and the coordinator, and between the coordinator and providers. However, TLS alone does not prevent the coordinator from reading data that passes through it — that's what E2E encryption addresses.
What We Don't Collect
- No accounts or personal information. You connect with a wallet address — no email, no name, no phone number.
- No prompt logging (with E2E). When E2E encryption is enabled, the coordinator stores
<encrypted>in place of your actual prompt and response text. - No tracking or analytics cookies. The chat app does not use cookies, third-party analytics, or fingerprinting.
- No model training on your data. Your prompts are never used to train, fine-tune, or improve any models.
End-to-End Encryption
RAM Aggregator supports optional end-to-end (E2E) encryption that prevents the coordinator from reading your prompts and responses. When enabled, only you and the assigned provider can see the content of your conversation.
How It Works
E2E encryption uses a two-phase protocol built on industry-standard cryptographic primitives:
- Key Exchange (Phase 1) — Your browser generates an ephemeral X25519 key pair. The coordinator assigns a provider and returns that provider's public key. Neither the coordinator nor anyone else learns your private key.
- Encrypted Inference — Your browser uses Elliptic Curve Diffie-Hellman (ECDH) to derive a shared secret with the provider. Your prompt is encrypted with AES-256-GCM before leaving your browser. The coordinator receives only an opaque encrypted blob and forwards it untouched to the provider. The provider decrypts the prompt, runs inference, encrypts the response with the same shared key, and sends it back. The coordinator never sees the plaintext.
Cryptographic Details
| Key Agreement | X25519 ECDH (Curve25519) |
| Key Derivation | HKDF-SHA256 with info string xram-e2e-v1 |
| Symmetric Encryption | AES-256-GCM (authenticated encryption) |
| Client Implementation | Web Crypto API (SubtleCrypto) — zero external dependencies |
| Provider Implementation | Python cryptography library (OpenSSL-backed) |
| Key Lifetime | Ephemeral — new key pair generated per request |
| Nonces | Random 96-bit, unique per encryption operation |
How to Enable E2E Encryption
In the chat app, click the lock icon next to the temperature selector. When the toggle is green and shows "E2E Encrypted", all subsequent messages will be encrypted. You'll see a green "E2E" badge on each encrypted message.
What E2E Encryption Protects Against
- Coordinator reading your data — The coordinator only sees encrypted blobs. It cannot read your prompts or responses.
- Server-side data breaches — If the coordinator's database is compromised, encrypted jobs contain only ciphertext, not plaintext.
- Network intermediaries — Combined with TLS, your data is protected at both the transport and application layers.
What E2E Encryption Does NOT Protect Against
We believe in being completely honest about limitations:
- The assigned provider sees your plaintext. This is unavoidable — the provider must decrypt your prompt to run inference on it, just as a translator must read a document to translate it. This is a fundamental property of computation, not a design flaw.
- Metadata is still visible to the coordinator. The coordinator can see: which model you requested, the approximate size of your prompt and response (from encrypted payload length), timing information, and your wallet address. It cannot see the actual content.
Data Visibility Matrix
This table shows exactly who can see what, depending on whether E2E encryption is enabled. We publish this so you can make informed decisions about when to use encryption.
With E2E Encryption OFF (Default)
| Data | You | Coordinator | Provider | Other Users |
|---|---|---|---|---|
| Your prompt text | Yes | Yes | Yes | No |
| AI response text | Yes | Yes | Yes | No |
| Model used | Yes | Yes | Yes | No |
| Token count | Yes | Yes | Yes | No |
| Your wallet address | Yes | Yes | No | No |
| Your IP address | Yes | Yes | No | No |
| Provider identity | Yes | Yes | Yes | No |
With E2E Encryption ON
| Data | You | Coordinator | Provider | Other Users |
|---|---|---|---|---|
| Your prompt text | Yes | No 🔒 | Yes | No |
| AI response text | Yes | No 🔒 | Yes | No |
| Model used | Yes | Yes | Yes | No |
| Token count (approx.) | Yes | Yes* | Yes | No |
| Your wallet address | Yes | Yes | No | No |
| Your IP address | Yes | Yes | No | No |
| Encrypted payload size | Yes | Yes | Yes | No |
* The coordinator can infer approximate token counts from encrypted payload sizes but cannot see actual content. Green "No" means the party cannot access this data. Amber "Yes" means the party can see this data.
Provider Trust Model
In any decentralized inference network, you are trusting the provider to honestly execute your inference. This is similar to how cloud computing works: when you use AWS, Azure, or any cloud API, the server running your code can see your data. The difference with RAM Aggregator is that:
- Providers are pseudonymous — They are identified by wallet address and node name, not personal identity.
- Providers don't store your data — Prompts and responses exist only in memory during inference and are discarded immediately after.
- Providers cannot correlate sessions — With E2E encryption, each request uses a new ephemeral key pair, so the provider cannot link requests to the same user across sessions.
- The coordinator does not share your wallet address with providers — Providers see the job content but not who sent it.
Anti-Gaming & Security
RAM Aggregator includes multiple layers of protection to prevent reward manipulation.
Validation Checks
Every inference result passes through six validation checks before rewards are paid:
- Response quality — Rejects empty or trivially short outputs.
- Token count plausibility — Detects inflated token_generated claims by comparing to actual output length.
- Timing validation — Flags responses that are faster than physically possible for the model size on Apple Silicon.
- Self-dealing detection — Blocks rewards when the provider and client wallets are the same.
- Duplicate output detection — Identifies repeated identical responses (copy-paste farming).
- Entropy check — Rejects very low-entropy outputs (repetitive junk like "the the the...").
Proof of Inference
The coordinator periodically sends challenge prompts to random online providers with known expected answers. Providers that return incorrect results receive reputation penalties, reducing their priority for future jobs.
Validation Under Encryption
When E2E encryption is enabled, the coordinator cannot read prompt or response content. Some validation checks are adjusted accordingly:
| Check | Plaintext | Encrypted | Notes |
|---|---|---|---|
| Response quality | Active | Skipped | Can't read encrypted output |
| Token count plausibility | Active | Active | Metadata, not encrypted |
| Timing validation | Active | Active | Metadata, not encrypted |
| Self-dealing detection | Active | Active | Wallet comparison |
| Duplicate output detection | Active | Skipped | Can't hash encrypted blobs |
| Entropy check | Active | Skipped | Can't read encrypted output |
| Replay protection | Active | Active | Nonce-based, unchanged |
Content-based checks are unavailable under encryption, but timing, metadata, and identity checks remain fully active. Provider reputation scoring over time compensates for the reduced validation surface.
Additional Protections
- Replay protection: Job result nonces are tracked with a 10-minute expiry window.
- Rate limiting: Per-key and per-wallet rate limits prevent abuse.
- Wallet validation: Provider registration requires a valid Ethereum address format.
- Admin-only key creation: API keys can only be created with the admin secret.
- Provider deduplication: When a provider reconnects with a new WebSocket, stale registrations with the same wallet and name are automatically removed. This prevents ghost providers from inflating the network count.
Smart Contract
The xRAM token is a standard ERC-20 contract deployed on Base mainnet with additional functions for the emission schedule and staking.
| Contract Address | 0x3BeB23287f24Db91249D8D90aD61a0e07F4F4C5c |
| Network | Base Mainnet (Chain ID: 8453) |
| View on BaseScan | Explorer Link |
Key Functions
balanceOf(address)— Standard ERC-20 balance checktransfer(to, amount)— Standard ERC-20 transferapprove(spender, amount)— Approve spending (used for escrow)transferFrom(from, to, amount)— Transfer on behalf (used for withdrawals)
Escrow Flow
- User calls
transfer(escrow_address, amount)to deposit. - Coordinator verifies the on-chain transaction and issues a session token.
- Escrow calls
approve(user_address, remaining)to allow withdrawal. - User calls
transferFrom(escrow, self, remaining)to reclaim tokens.
FAQ
What hardware do I need to be a provider?
Any Mac with Apple Silicon (M1 or later). The more RAM you have, the larger the models you can serve. An M1 with 8GB can run the 1B and 3B models; an M3 Max with 64GB can run everything up to and including 70B models. For truly massive models (405B, 671B), RAM Aggregator uses pipeline parallelism to split the model across multiple Macs — so even a few 32GB machines working together can serve a 405B model.
Is my data private?
RAM Aggregator offers end-to-end encryption that prevents the coordinator (our infrastructure) from reading your prompts or responses. When E2E is enabled, your data is encrypted in your browser before it ever leaves your device and can only be decrypted by the assigned provider. The provider must see your prompt to run inference — this is inherent to how AI models work. However, providers don't log or store your data, and they cannot identify you (your wallet address is not shared with them). See the Privacy & Encryption section for a full data visibility breakdown.
What does E2E encryption protect?
E2E encryption prevents the coordinator (our central server) from seeing your prompt and response content. It uses X25519 key exchange and AES-256-GCM encryption. The assigned provider can still see your data because it must run inference on it — but it cannot identify who you are, and it does not store your data. Think of it like end-to-end encryption in messaging apps: the server that routes messages can't read them, but the recipient (the provider running your model) can.
Does the provider store my data?
No. Providers process your prompt in memory, generate a response, and discard both immediately. There is no logging, no persistent storage, and no data retention on provider nodes. Providers also cannot see your wallet address or correlate your requests across sessions when E2E is enabled.
How much can I earn as a provider?
Earnings depend on the current emission epoch, how many jobs you complete, and your staking level. Early providers in Epoch 1 earn the most. Check the emission schedule section for detailed rates.
Can I run this on Linux or Windows?
Currently, the provider app only supports macOS with Apple Silicon due to the MLX framework requirement. The user chat app works in any browser.
Is xRAM a real cryptocurrency?
xRAM is a real ERC-20 token on Base mainnet (Coinbase's L2). It has a fixed supply of 100M tokens with a Bitcoin-style halving emission schedule.
What happens if a provider goes offline mid-job?
The coordinator detects offline providers via heartbeat monitoring. If a provider disconnects during a job, the job is automatically re-queued and assigned to another available provider. When a provider reconnects, stale registrations are cleaned up so the network always shows accurate provider counts.
How does pipeline parallelism work?
When a model is too large for any single Mac, the coordinator's shard scheduler splits it across multiple providers. Each provider loads a slice of the model's transformer layers and listens on a TCP port. During inference, hidden states flow through the pipeline: Shard 0 processes the first layers and passes its output to Shard 1, which processes the next set and passes to Shard 2, and so on. The final shard produces the output logits. This happens automatically — users just request a model and get a response.
Why does my first request take longer?
If the requested model isn't already loaded in a provider's memory, it needs to be loaded from disk (or downloaded first). This can take 10–60 seconds depending on model size. Subsequent requests to the same model are much faster since the model stays in memory. The coordinator routes to providers that already have the model loaded whenever possible.
Can I limit how much of my Mac's resources are used?
Yes. The menubar app includes a RAM Allocation slider that lets you cap how much memory RAM Aggregator can use. Models that exceed your cap are automatically disabled. You can also enable Prevent Sleep to keep your Mac serving jobs while you're away, or disable it to let your Mac sleep normally when idle.
Does the app update automatically?
Yes, if auto-update is enabled (the default). The coordinator can push updates to all connected providers. When an update arrives, your daemon downloads the latest version, applies it, and restarts. You can disable auto-update in your config file if you prefer manual control.