RAM Aggregator

The decentralized AI inference network powered by Apple Silicon.

RAM Aggregator is a peer-to-peer network that connects people who want to use AI (users) with people who have spare compute (providers). Users pay with xRAM tokens to run LLM inference; providers earn xRAM by sharing their Mac's idle processing power.

The entire system runs on three pillars:

Coordinator — The central routing layer that matches jobs to providers, manages the token ledger, and exposes an OpenAI-compatible API. Includes a shard scheduler for automatically splitting large models across multiple Macs.
Provider Nodes — macOS menubar apps that connect to the coordinator via WebSocket, download models from HuggingFace, and run inference locally using Apple's MLX framework. Providers can serve full models or individual shards as part of a distributed pipeline.
xRAM Token — An ERC-20 token on Base (Coinbase's L2) that powers all payments and rewards. 100M fixed supply with a Bitcoin-style halving emission schedule.

Highlights

Pipeline parallelism — Models too large for a single Mac are automatically split across multiple providers and run as a distributed pipeline.
Smart model routing — The coordinator intelligently selects providers that already have models loaded, falling back to lazy-loading when necessary.
Native chat templates — Each model receives its native chat format (Llama, Qwen, Mistral) for accurate, clean responses with no prompt leakage.
Non-blocking inference — MLX inference runs in a dedicated thread, keeping WebSocket connections alive during long generation runs (40s+).
Provider controls — RAM allocation slider and Prevent Sleep toggle let providers manage exactly how much of their machine to share.
Auto-update system — Providers receive over-the-air updates pushed from the coordinator, keeping the fleet in sync.

Current Status RAM Aggregator is live on Base mainnet. The smart contract is deployed, the coordinator is running on Fly.io, and the provider app is available for macOS. Pipeline parallelism, smart routing, and native chat templates are all live.

Quickstart: Users

Start chatting with AI in under a minute.

Option 1: Free Demo

Visit the chat app and start typing. The app uses a shared demo key (xram_free_test) that gives you a handful of free messages per day to try the network.

Option 2: Connect MetaMask (Unlimited)

Open the chat app and click Connect Wallet.
Switch MetaMask to the Base network (the app will prompt you).
Click Deposit xRAM and choose an amount. Your xRAM tokens are transferred to the escrow smart contract.
A session token is issued automatically. You can now chat with any model, paying per token from your deposit.
When you're done, click Withdraw to reclaim your unused balance back to your wallet.

Option 3: OpenAI SDK (for Developers)

Point any OpenAI-compatible client at the RAM Aggregator API:

from openai import OpenAI

client = OpenAI(
    base_url="https://ramaggregator.com/v1",
    api_key="your_session_token_here"  # sess_... from deposit
)

response = client.chat.completions.create(
    model="llama-3.2-3b",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

Quickstart: Providers

Turn your Mac into an AI inference node and start earning xRAM tokens.

Requirements

macOS with Apple Silicon (M1/M2/M3/M4)
At least 8GB RAM (more RAM = more models you can serve)
Stable internet connection
An Ethereum wallet address (for receiving xRAM rewards)

Installation

Download the RAM Aggregator macOS app (DMG installer).
Drag it to Applications and launch. A chip icon appears in your menu bar.
The app creates a Python virtual environment and installs dependencies automatically.

Configuration

Click the menubar icon and select Rename Node to set a unique name for your provider.
Open the Models submenu and click Enable on any models you want to serve.
Models that exceed your system RAM are automatically greyed out and non-selectable.
The model will download from HuggingFace (one-time) and then show as [ready].
Use the RAM Allocation slider to control how much of your system memory is available for models (e.g., cap at 16GB if you want to keep resources free for other work).
Enable Prevent Sleep if you want your Mac to stay awake and serve jobs while unattended.

Running

Click Start Daemon from the menubar. Your node connects to the coordinator, reports its available models, and begins accepting inference jobs. The status indicator turns green when you're online.

You can enable multiple models simultaneously (RAM permitting). The coordinator routes jobs to you based on which models you have loaded, your available memory, and your reputation score.

If the coordinator requests it, your node can also serve as a shard worker for pipeline parallelism — loading a portion of a large model and processing hidden states as part of a distributed pipeline. This happens automatically; no extra configuration needed.

Earning

Every time you complete an inference job, xRAM tokens are minted on-chain to your wallet address. The amount depends on the current emission epoch, the number of tokens generated, and your staking multiplier. You can check your earnings from the menubar's dashboard link.

Connection Resilience

The provider daemon automatically reconnects if the WebSocket connection drops. On reconnection, stale registrations are cleaned up so the coordinator always shows an accurate count of active providers. Inference runs in a dedicated thread so that long generation jobs (40+ seconds for large models) don't block WebSocket keepalive pings.

Uninstalling Models

To remove a model: open the Models submenu, expand the model, and click Uninstall (delete files). This removes the model from disk, HuggingFace cache, and memory.

System Architecture

RAM Aggregator is a hub-and-spoke architecture with a central coordinator and distributed provider nodes. It supports two inference modes: single-provider (for models that fit in one Mac's memory) and pipeline-parallel (for models that need to be split across multiple Macs).


  [Users / AI Agents]
        |
        | HTTPS (OpenAI-compatible API)
        v
  [Coordinator + Shard Scheduler]  <--->  [Base L2 Smart Contract]
        |                                        (xRAM ERC-20)
        | WebSocket
        |
   ┌────┴─────────────────────────────────┐
   |                                       |
   |  Single Provider         Pipeline     |
   |  (fits in 1 Mac)        Parallelism   |
   |                         (split model) |
   v                                       v
  [Provider 1]              [Shard 0] ──TCP──> [Shard 1] ──TCP──> [Shard 2]
   M3 Max, 64GB              Mac #1              Mac #2             Mac #3
   Qwen3 8B (full)           Layers 0-10         Layers 11-21      Layers 22-32

Coordinator

The coordinator is a FastAPI server deployed on Fly.io. It handles:

Job routing and dispatch — Matches requests to the best provider, preferring those with models already loaded in memory.
Shard scheduling — Automatically splits large models across multiple providers when no single node has enough RAM.
Token ledger and emission — Manages the xRAM reward schedule and on-chain minting.
Provider registry — Tracks heartbeats, deduplicates reconnections, and maintains accurate provider counts.
OpenAI-compatible gateway — Exposes /v1/chat/completions with native chat template formatting for each model family.
E2E encryption relay — Forwards encrypted blobs without decryption for private inference.

Provider Nodes

Each provider runs a macOS menubar app that maintains a persistent WebSocket connection to the coordinator. Providers report their hardware specs, loaded models, and availability via heartbeats every 10 seconds.

When a job arrives, the provider runs inference using Apple's MLX framework. Inference executes in a dedicated thread (asyncio.to_thread) so the asyncio event loop stays responsive for WebSocket pings and heartbeats, even during long generation runs. Prompts are formatted using each model's native chat template (e.g., <|im_start|> for Qwen3, <|begin_of_text|> for Llama) to ensure clean, accurate output.

Providers can also act as shard workers for pipeline parallelism. When the coordinator assigns a shard, the provider loads only the specified transformer layers and listens on a TCP port for hidden state data from the previous shard in the pipeline.

On-Chain Layer

The xRAM ERC-20 token is deployed on Base mainnet. The coordinator holds a signer key that can mint tokens from the treasury allocation. User deposits are handled via an escrow wallet with approve/transferFrom for withdrawals.

Job Flow

Here's what happens when a user sends a chat message:

Request — User sends a chat message via the web UI or OpenAI API. The gateway validates their API key or session token.
Model Resolution — The gateway resolves the model alias (e.g., qwen3-8b) to the full HuggingFace ID and formats the prompt using the model's native chat template.
Routing Decision — The coordinator checks if any single provider can serve the model. If not, it checks for an existing pipeline shard group or triggers automatic shard scheduling.
Dispatch (Single Provider) — For models that fit in one Mac, the dispatcher selects the best provider: it prefers those with the model already loaded, then falls back to any online provider with enough RAM.
Dispatch (Pipeline) — For models split across multiple Macs, the coordinator routes the request to the pipeline orchestrator, which sends hidden states through each shard worker in sequence.
Inference — The provider runs inference through MLX in a dedicated thread (keeping the WebSocket alive), applies the model's native chat template, strips any internal reasoning tags, and returns the result.
Validation — The coordinator validates the result (output quality, timing plausibility, self-dealing, duplicate detection) before paying the reward.
Payment — If validation passes, xRAM tokens are minted on-chain to the provider's wallet. The user's session deposit is deducted based on dynamic pricing.

Distributed Inference

Some models are too large for any single Mac. A 405B-parameter model needs roughly 250 GB of RAM — more than most machines have. RAM Aggregator solves this with pipeline parallelism: the model's transformer layers are split into shards, each shard runs on a different Mac, and hidden states flow through the pipeline via TCP.

How Pipeline Parallelism Works

A transformer model is a stack of identical layers. If a model has 60 layers and three Macs are available, the coordinator splits it into three shards:


  Mac #1 (Shard 0):  Layers 0–19   + Embedding layer
  Mac #2 (Shard 1):  Layers 20–39
  Mac #3 (Shard 2):  Layers 40–59  + Output head (logits)

  Flow for each token:
    [Prompt] → Embed → Shard 0 → TCP → Shard 1 → TCP → Shard 2 → Logits → [Token]

Each shard worker loads only its assigned layers into memory. The orchestrator (running on the coordinator or the first shard) manages the autoregressive generation loop: it tokenizes the input, computes embeddings, sends hidden states through the pipeline, collects logits from the final shard, samples the next token, and repeats.

Automatic Shard Scheduling

When a user requests a model that no single provider can serve, the coordinator's shard scheduler takes over:

It queries the model's layer count and estimates RAM per shard.
It finds online providers with enough free memory to hold at least one shard.
It assigns shards to providers and sends LOAD_SHARD commands via WebSocket.
Each provider downloads the model (if needed), loads its assigned layers, and starts a TCP shard worker.
When all shards report SHARD_READY, the pipeline is marked as complete and ready for inference.

This entire process is transparent to the user. They request llama-3.1-405b and get a response — they don't need to know it was split across four Macs.

Shard Worker Architecture

Each shard worker is a lightweight TCP server that handles three message types:

FORWARD_PASS — Receives serialized hidden states, runs them through its transformer layers, and returns the output. If it's the last shard, it applies the final layer norm and output projection to produce logits.
KV_CACHE_INIT — Initializes the key-value cache for efficient autoregressive generation.
HEARTBEAT — Confirms the shard is alive and responsive.

Hidden states are serialized as compact binary arrays (MLX arrays → numpy → bytes) with shape/dtype metadata. The protocol uses length-prefixed JSON headers for routing and raw binary payloads for tensor data.

Performance Pipeline parallelism adds latency per token (each shard hop is a TCP round-trip) but makes previously impossible models accessible. A Llama 3.1 405B that would need 250 GB of RAM can run across three M4 Pro Macs with 96 GB each. The throughput scales with the number of shards and the network bandwidth between them.

Network Requirements For best pipeline performance, shard workers should be on the same local network (LAN). The coordinator assigns shards to providers that can reach each other. In practice, this means running multiple Macs in the same home or office. WAN pipeline parallelism is possible but adds significant per-token latency.

Smart Routing

The coordinator makes intelligent decisions about where to send each inference request. Here's the priority order:

Provider Selection

Model already loaded — The coordinator prefers providers that have the requested model loaded and ready in memory. This avoids model loading time (which can be 10–60 seconds for larger models).
Preferred provider — If the client specifies a preferred_provider ID (e.g., for testing or affinity), the coordinator routes to that provider if it's online.
Best available — Among providers with the right model, the coordinator picks the one with the highest available_memory × reputation score.
Lazy-load fallback — If no provider has the model loaded, any online provider with enough RAM is selected. The model will be downloaded and loaded on first use.
Pipeline fallback — If no single provider has enough RAM, the shard scheduler splits the model across multiple providers.

Chat Template Formatting

Different model families require different prompt formats. The coordinator sends prompts in a generic System: ... / User: ... / Assistant: format, and the provider's inference engine applies the correct chat template using the tokenizer's built-in apply_chat_template function. This ensures each model sees its native token format:

Model Family	Template Style	Special Handling
Llama 3.x	`<\|begin_of_text\|>` + role headers	Standard instruct format
Qwen 2.5 / 3	`<\|im_start\|>` ChatML-style	Qwen3: `/no_think` directive to suppress chain-of-thought
Mistral	`[INST]` markers	Standard instruct format
DeepSeek	ChatML-style	`<think>` tags stripped from output

If a model includes internal reasoning in <think>...</think> tags, the provider automatically strips these before returning the response, so users see only the final answer.

Model Catalog

All models are 4-bit quantized MLX versions from the mlx-community HuggingFace organization.

Edge & Small (8–16 GB RAM)

Model	Params	RAM	Speed	Best For
Llama 3.2 1B Instruct	1B	~1.5 GB	~150 tok/s	Simple tasks, fast replies
Llama 3.2 3B Instruct	3B	~3 GB	~100 tok/s	Balanced speed & quality
Mistral 7B Instruct v0.3	7B	~5.5 GB	~60 tok/s	Reasoning, instruction following
Qwen 2.5 7B Instruct	7B	~5.5 GB	~60 tok/s	Multilingual, coding
Qwen3 8B	8B	~5.5 GB	~55 tok/s	Latest gen, built-in thinking mode
Llama 3.1 8B Instruct	8B	~6 GB	~50 tok/s	General purpose

Medium (16–64 GB RAM)

Model	Params	RAM	Speed	Best For
Qwen3 30B MoE NEW	30B (3B active)	~18 GB	~90 tok/s	Smart like 30B, fast like 3B
Qwen3 Coder 30B MoE NEW	30B (3B active)	~18 GB	~90 tok/s	Sonnet-class agentic coding
Qwen 2.5 Coder 32B Instruct	32B	~20 GB	~25 tok/s	Code generation, 80+ languages
DeepSeek R1 Distill 32B	32B	~20 GB	~25 tok/s	Chain-of-thought reasoning
Llama 3.3 70B Instruct	70B	~42 GB	~12 tok/s	Latest Llama, best quality/size
Llama 3.1 70B Instruct	70B	~42 GB	~10 tok/s	Proven workhorse
Qwen 2.5 72B Instruct	72B	~45 GB	~10 tok/s	GPT-4 class multilingual

Large (96–192 GB RAM)

Model	Params	RAM	Speed	Best For
Mistral Large 2	123B	~75 GB	~6 tok/s	Mistral flagship, complex tasks
Qwen3 235B MoE NEW	235B (22B active)	~135 GB	~30 tok/s	Thinking mode, 119 languages

XL (192–384 GB RAM)

Model	Params	RAM	Speed	Best For
Qwen3.5 397B MoE NEW	397B (17B active)	~225 GB	~35 tok/s	Latest Qwen flagship, hybrid DeltaNet, 201 languages
Llama 3.1 405B Instruct	405B	~250 GB	~3 tok/s	Largest open dense model
Qwen3 Coder 480B MoE NEW	480B (35B active)	~280 GB	~18 tok/s	Ultimate coding, frontier-class

Ultra (400+ GB RAM)

Model	Params	RAM	Speed	Best For
DeepSeek R1 671B	671B MoE	~400 GB	~5 tok/s	Ultimate reasoning (37B active)
DeepSeek V3 671B	671B MoE	~400 GB	~5 tok/s	Flagship general-purpose MoE

Tip Speeds are approximate on M3 Max / M4 Pro. Newer chips like M4 Max and M5 series are significantly faster. MoE (Mixture of Experts) models only activate a fraction of their parameters per token — so Qwen3.5 397B runs at ~35 tok/s despite having 397B total params (only 17B active). The menubar app automatically blocks models that exceed 85% of your system RAM.

512 GB Macs Mac Studio and Mac Pro with M3/M4 Ultra can run models up to 671B parameters entirely in unified memory — including DeepSeek R1, DeepSeek V3, Qwen3 Coder 480B, and Qwen3.5 397B. No consumer GPU setup can match this. This is where RAM Aggregator truly shines.

Provider Menubar App

The RAM Aggregator provider app lives in your macOS menu bar. It's a native Swift app that manages a Python daemon under the hood. Everything is controlled from the menu bar icon — no terminal required.

What You See

The menu bar icon (a chip symbol) shows your connection status at a glance. Click it to access:

Status indicator — Green when connected and serving jobs, grey when offline.
Start / Stop Daemon — One click to go online or offline.
Models submenu — Browse the full model catalog, enable/disable models, download new ones, and uninstall models you no longer need. Models that exceed your available RAM are automatically greyed out.
Rename Node — Set a friendly name for your provider that appears on the network dashboard.
Wallet address — Your Ethereum address for receiving xRAM rewards.
Dashboard link — Opens the web dashboard showing your earnings, job history, and reputation score.

Under the Hood

When you click Start Daemon, the app launches a Python process that:

Loads all enabled models into memory using MLX.
Connects to the coordinator via a persistent WebSocket.
Registers its hardware specs, loaded models, and encryption public key.
Sends heartbeats every 10 seconds to stay in the provider registry.
Accepts inference jobs, runs them through MLX, and returns results.

All configuration is stored in ~/.ram-aggregator/config.json. Model weights are cached in the standard HuggingFace cache directory.

Provider Controls & Settings

RAM Allocation

The RAM Allocation slider lets you control exactly how much memory RAM Aggregator can use for models. This is useful if you want to keep some RAM free for other applications while still contributing to the network.

When you lower the RAM limit, models that exceed the new cap are automatically greyed out in the model selector. The daemon reports the capped memory to the coordinator, which takes it into account when routing jobs. For example, if your Mac has 64 GB but you set the limit to 32 GB, you can still serve 7B–32B models comfortably without impacting your other work.

Prevent Sleep

macOS puts your Mac to sleep after a period of inactivity, which disconnects the provider from the network. The Prevent Sleep toggle keeps your Mac awake so it can serve inference jobs around the clock. This is ideal for dedicated provider setups (e.g., a Mac Mini or Mac Studio running headless).

When enabled, the app uses macOS power assertions to prevent system sleep. Display sleep still occurs normally — only system sleep is prevented. Disable the toggle to restore your normal sleep settings.

E2E Encryption Key

On first launch, the daemon generates an X25519 key pair for end-to-end encryption. The private key is stored in ~/.ram-aggregator/encryption_key.bin with restrictive file permissions (0600). The public key is included in every registration and heartbeat message, allowing clients to encrypt prompts specifically for your provider.

Auto-Update System

RAM Aggregator includes an over-the-air update mechanism that keeps providers running the latest version without manual intervention.

How It Works

When a new version is available, the coordinator sends a FORCE_UPDATE message to all connected providers via WebSocket.
The daemon writes a flag file with the target version and executes the update script.
The update script downloads the latest daemon code, updates dependencies, and restarts the process.
The provider reconnects to the coordinator with the new version.

Auto-update can be disabled in ~/.ram-aggregator/config.json by setting "auto_update": false. When disabled, the provider will log a warning about the available update but won't apply it automatically.

Version Management The coordinator tracks each provider's reported version. Providers running outdated versions receive update prompts. Critical security patches may require providers to update before they can continue serving jobs.

xRAM Token Overview

Token Name	RAM Aggregator (xRAM)
Standard	ERC-20
Chain	Base Mainnet
Contract	`0x3BeB23287f24Db91249D8D90aD61a0e07F4F4C5c`
Total Supply	100,000,000 xRAM
Decimals	18
Trade	Aerodrome (ETH/xRAM)

Allocation

70,000,000 xRAM (70%) — Provider treasury. Distributed to providers as inference rewards via the emission schedule.
30,000,000 xRAM (30%) — Team treasury. For development, partnerships, and ecosystem growth.

Emission Schedule

xRAM uses a Bitcoin-inspired halving model. The 70M provider treasury is divided into 4 epochs of 17.5M tokens each. Each epoch halves the base reward rate.

Epoch	Tokens Available	Reward Multiplier	Daily Cap
1 (0 – 17.5M minted)	17,500,000	1.0x	5,000,000/day
2 (17.5M – 35M)	17,500,000	0.5x	2,500,000/day
3 (35M – 52.5M)	17,500,000	0.25x	1,250,000/day
4 (52.5M – 70M)	17,500,000	0.125x	625,000/day

Additionally, hourly emission caps prevent flash-draining. The reward per job is calculated as: base_reward * epoch_multiplier * staking_bonus, capped by both daily and hourly limits.

Early Mover Advantage Providers who join during Epoch 1 earn 8x more per job than those who join in Epoch 4. The halving schedule creates strong incentive to participate early.

Dynamic Pricing

The cost of inference adjusts automatically based on network utilization, similar to Ethereum gas fees.

Factor	Effect
Network utilization	Higher utilization → higher prices
Model size	70B models cost ~10x more than 1B models
Tokens generated	Cost scales linearly with output length

The pricing engine targets 60% network utilization. Prices have a floor of 0.1 xRAM/1K tokens and a ceiling of 50 xRAM/1K tokens. Current pricing can be checked via GET /api/v1/pricing.

Staking

Providers can stake xRAM tokens to increase their reward multiplier and get priority in job routing.

Minimum stake: 1,000 xRAM
Staking bonus: Up to 2x reward multiplier based on amount staked
Unstaking lockup: 7-day cooldown period before tokens are returned
Grace period: New networks have a 72-hour grace period where staking is not required to earn rewards

API Authentication

All API requests use the Authorization header with a Bearer token.

Key Types

Type	Format	Use Case
Demo	`xram_free_test`	Free testing (rate limited, shared)
Session	`sess_...`	MetaMask deposit sessions (auto-issued)
Live	`xram_live_...`	Production keys (admin-created)
Agent	`xram_agent_...`	AI agent keys (admin-created)

Session tokens are issued automatically when you deposit xRAM through the chat app. For programmatic access, use your session token as the API key.

Chat Completions

POST /v1/chat/completions

OpenAI-compatible chat completions endpoint. Works with any OpenAI SDK.

Request Body

{
  "model": "llama-3.2-3b",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
  ],
  "max_tokens": 256,
  "temperature": 0.7,
  "stream": false
}

Response

{
  "id": "chatcmpl-a1b2c3...",
  "object": "chat.completion",
  "model": "llama-3.2-3b",
  "choices": [{
    "index": 0,
    "message": {"role": "assistant", "content": "The capital of France is Paris."},
    "finish_reason": "stop"
  }],
  "usage": {"prompt_tokens": 24, "completion_tokens": 8, "total_tokens": 32},
  "xram_tokens_charged": 0.032,
  "xram_session_remaining": 9967.5
}

Model Aliases

Use the short alias or the full HuggingFace model ID in API requests. Both work.

Alias	Full Model ID
`llama-3.2-1b`	mlx-community/Llama-3.2-1B-Instruct-4bit
`llama-3.2-3b`	mlx-community/Llama-3.2-3B-Instruct-4bit
`llama-3.1-8b`	mlx-community/Meta-Llama-3.1-8B-Instruct-4bit
`mistral-7b`	mlx-community/Mistral-7B-Instruct-v0.3-4bit
`qwen-2.5-7b`	mlx-community/Qwen2.5-7B-Instruct-4bit
`qwen3-8b`	mlx-community/Qwen3-8B-4bit
`qwen3-30b`	mlx-community/Qwen3-30B-A3B-4bit
`qwen3-coder-30b`	mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit
`qwen-2.5-coder-32b`	mlx-community/Qwen2.5-Coder-32B-Instruct-4bit
`deepseek-r1-distill-32b`	mlx-community/DeepSeek-R1-Distill-Qwen-32B-4bit
`llama-3.3-70b`	mlx-community/Llama-3.3-70B-Instruct-4bit
`llama-3.1-70b`	mlx-community/Meta-Llama-3.1-70B-Instruct-4bit
`qwen-2.5-72b`	mlx-community/Qwen2.5-72B-Instruct-4bit
`mistral-large-2`	mlx-community/Mistral-Large-Instruct-2407-4bit
`qwen3-235b`	mlx-community/Qwen3-235B-A22B-4bit
`qwen3.5-397b`	mlx-community/Qwen3.5-397B-A17B-nvfp4
`llama-3.1-405b`	mlx-community/Meta-Llama-3.1-405B-Instruct-4bit
`qwen3-coder-480b`	mlx-community/Qwen3-Coder-480B-A35B-Instruct-4bit
`deepseek-r1`	mlx-community/DeepSeek-R1-4bit
`deepseek-v3`	mlx-community/DeepSeek-V3-4bit

Models

GET /v1/models

List available models on the network. Returns models that at least one online provider has loaded, or the full catalog if no providers are online.

GET /api/v1/models/marketplace

Enriched model listing with provider counts, real-time pricing, and performance metrics for each model.

Deposits & Sessions

POST /api/v1/deposit/verify

After transferring xRAM to the escrow address via MetaMask, call this endpoint with the transaction hash to verify the deposit on-chain and receive a session token.

{
  "tx_hash": "0xabc123...",
  "wallet_address": "0xYourWalletAddress..."
}

GET /api/v1/deposit/balance

Check your remaining deposit balance. Requires session token in Authorization header.

POST /api/v1/deposit/withdraw

Close your session and withdraw remaining balance. The escrow contract is updated to allow you to call transferFrom to reclaim your tokens on-chain.

GET /api/v1/config/escrow

Get the escrow wallet address and contract details needed for MetaMask deposits.

E2E Encryption Endpoints

These endpoints implement the two-phase encrypted inference protocol. Use them when you want end-to-end encryption so the coordinator cannot see your prompts or responses.

GET /v1/e2e/keys

List online providers with their X25519 public encryption keys. Public endpoint — no authentication required. Use this to inspect which providers support E2E before initiating a request.

POST /v1/e2e/init

Phase 1: Send job metadata (model, max_tokens, temperature) without any prompt data. The coordinator assigns a provider and returns their public key plus a job ID.

// Request
{ "model": "qwen3-8b", "max_tokens": 256, "temperature": 0.7 }

// Response
{
  "job_id": "abc123...",
  "provider_id": "p_xyz",
  "provider_name": "Steve's MacBook",
  "provider_public_key_b64": "base64-encoded X25519 public key"
}

POST /v1/e2e/prompt

Submit the encrypted prompt for a previously initialized job. The coordinator forwards the encrypted blob to the assigned provider without decryption.

// Request
{
  "job_id": "abc123...",
  "encrypted_prompt_b64": "base64-encoded AES-256-GCM ciphertext",
  "prompt_nonce_b64": "base64-encoded 96-bit nonce",
  "client_public_key_b64": "base64-encoded ephemeral X25519 public key"
}

// Response (encrypted)
{
  "encrypted": true,
  "encrypted_output_b64": "base64-encoded ciphertext",
  "output_nonce_b64": "base64-encoded nonce",
  "xram_job_id": "abc123...",
  "xram_tokens_charged": 0.15
}

Providers

GET /api/v1/providers

List online providers with hardware specs, loaded models, and reputation scores. Financial details (earnings, wallet addresses) require authentication.

GET /api/v1/network

Network-wide statistics: total providers, online count, available memory, and tokens minted.

GET /api/v1/pricing

Current dynamic pricing state: utilization, price multiplier, and base rates.

GET /api/v1/pricing/quote

Get a price quote for a specific model and token count. Parameters: model (e.g., "7B"), max_tokens (e.g., 256).

Privacy Overview

RAM Aggregator is a decentralized network. Understanding who can see what is essential to using it with confidence. This section explains exactly what data is collected, who can access it, and how our encryption features protect you.

Our Privacy Commitment We believe in radical transparency. Rather than burying details in legal boilerplate, this page tells you plainly and precisely what data is visible, to whom, and what protections exist. If something isn't encrypted, we'll say so.

The Three Parties

Every inference request involves three parties. Each has different levels of data access:

You (the User) — You control your prompts, your wallet, and whether to enable end-to-end encryption.
The Coordinator — The central routing server. It matches your request to a provider and manages billing. Think of it like a postal service: it needs to know the destination, but doesn't need to read the letter.
The Provider — The Mac running your inference. The provider must see your prompt to generate a response — this is fundamental to how AI inference works. You cannot ask a model to answer a question without the model seeing the question.

Transport Security

All connections use TLS 1.3 (HTTPS / WSS). This protects against eavesdroppers on the network (ISPs, Wi-Fi snoopers, etc.). TLS encrypts data in transit between you and the coordinator, and between the coordinator and providers. However, TLS alone does not prevent the coordinator from reading data that passes through it — that's what E2E encryption addresses.

What We Don't Collect

No accounts or personal information. You connect with a wallet address — no email, no name, no phone number.
No prompt logging (with E2E). When E2E encryption is enabled, the coordinator stores <encrypted> in place of your actual prompt and response text.
No tracking or analytics cookies. The chat app does not use cookies, third-party analytics, or fingerprinting.
No model training on your data. Your prompts are never used to train, fine-tune, or improve any models.

End-to-End Encryption

RAM Aggregator supports optional end-to-end (E2E) encryption that prevents the coordinator from reading your prompts and responses. When enabled, only you and the assigned provider can see the content of your conversation.

How It Works

E2E encryption uses a two-phase protocol built on industry-standard cryptographic primitives:

Key Exchange (Phase 1) — Your browser generates an ephemeral X25519 key pair. The coordinator assigns a provider and returns that provider's public key. Neither the coordinator nor anyone else learns your private key.
Encrypted Inference — Your browser uses Elliptic Curve Diffie-Hellman (ECDH) to derive a shared secret with the provider. Your prompt is encrypted with AES-256-GCM before leaving your browser. The coordinator receives only an opaque encrypted blob and forwards it untouched to the provider. The provider decrypts the prompt, runs inference, encrypts the response with the same shared key, and sends it back. The coordinator never sees the plaintext.

Cryptographic Details

Key Agreement	X25519 ECDH (Curve25519)
Key Derivation	HKDF-SHA256 with info string `xram-e2e-v1`
Symmetric Encryption	AES-256-GCM (authenticated encryption)
Client Implementation	Web Crypto API (SubtleCrypto) — zero external dependencies
Provider Implementation	Python `cryptography` library (OpenSSL-backed)
Key Lifetime	Ephemeral — new key pair generated per request
Nonces	Random 96-bit, unique per encryption operation

How to Enable E2E Encryption

In the chat app, click the lock icon next to the temperature selector. When the toggle is green and shows "E2E Encrypted", all subsequent messages will be encrypted. You'll see a green "E2E" badge on each encrypted message.

Performance Impact E2E encryption adds less than 0.5 milliseconds of overhead per request. The cryptographic operations (key generation, ECDH, AES-GCM) are extremely fast on modern hardware. The only measurable addition is one extra HTTP round-trip for the key exchange phase, which typically adds 10-100ms depending on your network latency. For inference requests that take 2-30+ seconds, this is effectively invisible.

What E2E Encryption Protects Against

Coordinator reading your data — The coordinator only sees encrypted blobs. It cannot read your prompts or responses.
Server-side data breaches — If the coordinator's database is compromised, encrypted jobs contain only ciphertext, not plaintext.
Network intermediaries — Combined with TLS, your data is protected at both the transport and application layers.

What E2E Encryption Does NOT Protect Against

We believe in being completely honest about limitations:

The assigned provider sees your plaintext. This is unavoidable — the provider must decrypt your prompt to run inference on it, just as a translator must read a document to translate it. This is a fundamental property of computation, not a design flaw.
Metadata is still visible to the coordinator. The coordinator can see: which model you requested, the approximate size of your prompt and response (from encrypted payload length), timing information, and your wallet address. It cannot see the actual content.

Important E2E encryption protects your data from the coordinator (the infrastructure), not from the provider (the compute node). If you need complete privacy from all parties, that would require homomorphic encryption or secure enclaves — technologies that are not yet practical for LLM inference. We are tracking these developments and will adopt them when viable.

Data Visibility Matrix

This table shows exactly who can see what, depending on whether E2E encryption is enabled. We publish this so you can make informed decisions about when to use encryption.

With E2E Encryption OFF (Default)

Data	You	Coordinator	Provider	Other Users
Your prompt text	Yes	Yes	Yes	No
AI response text	Yes	Yes	Yes	No
Model used	Yes	Yes	Yes	No
Token count	Yes	Yes	Yes	No
Your wallet address	Yes	Yes	No	No
Your IP address	Yes	Yes	No	No
Provider identity	Yes	Yes	Yes	No

With E2E Encryption ON

Data	You	Coordinator	Provider	Other Users
Your prompt text	Yes	No 🔒	Yes	No
AI response text	Yes	No 🔒	Yes	No
Model used	Yes	Yes	Yes	No
Token count (approx.)	Yes	Yes*	Yes	No
Your wallet address	Yes	Yes	No	No
Your IP address	Yes	Yes	No	No
Encrypted payload size	Yes	Yes	Yes	No

* The coordinator can infer approximate token counts from encrypted payload sizes but cannot see actual content. Green "No" means the party cannot access this data. Amber "Yes" means the party can see this data.

Provider Trust Model

In any decentralized inference network, you are trusting the provider to honestly execute your inference. This is similar to how cloud computing works: when you use AWS, Azure, or any cloud API, the server running your code can see your data. The difference with RAM Aggregator is that:

Providers are pseudonymous — They are identified by wallet address and node name, not personal identity.
Providers don't store your data — Prompts and responses exist only in memory during inference and are discarded immediately after.
Providers cannot correlate sessions — With E2E encryption, each request uses a new ephemeral key pair, so the provider cannot link requests to the same user across sessions.
The coordinator does not share your wallet address with providers — Providers see the job content but not who sent it.

Comparison to Centralized AI Services When you use ChatGPT, Claude, or Gemini, the service provider sees all your prompts, stores conversation history, and may use your data for training. With RAM Aggregator + E2E encryption, neither the coordinator nor any centralized entity has access to your conversation content. The individual provider sees your prompt only for the duration of inference and has no persistent storage or logging of your data.

Anti-Gaming & Security

RAM Aggregator includes multiple layers of protection to prevent reward manipulation.

Validation Checks

Every inference result passes through six validation checks before rewards are paid:

Response quality — Rejects empty or trivially short outputs.
Token count plausibility — Detects inflated token_generated claims by comparing to actual output length.
Timing validation — Flags responses that are faster than physically possible for the model size on Apple Silicon.
Self-dealing detection — Blocks rewards when the provider and client wallets are the same.
Duplicate output detection — Identifies repeated identical responses (copy-paste farming).
Entropy check — Rejects very low-entropy outputs (repetitive junk like "the the the...").

Proof of Inference

The coordinator periodically sends challenge prompts to random online providers with known expected answers. Providers that return incorrect results receive reputation penalties, reducing their priority for future jobs.

Validation Under Encryption

When E2E encryption is enabled, the coordinator cannot read prompt or response content. Some validation checks are adjusted accordingly:

Check	Plaintext	Encrypted	Notes
Response quality	Active	Skipped	Can't read encrypted output
Token count plausibility	Active	Active	Metadata, not encrypted
Timing validation	Active	Active	Metadata, not encrypted
Self-dealing detection	Active	Active	Wallet comparison
Duplicate output detection	Active	Skipped	Can't hash encrypted blobs
Entropy check	Active	Skipped	Can't read encrypted output
Replay protection	Active	Active	Nonce-based, unchanged

Content-based checks are unavailable under encryption, but timing, metadata, and identity checks remain fully active. Provider reputation scoring over time compensates for the reduced validation surface.

Additional Protections

Replay protection: Job result nonces are tracked with a 10-minute expiry window.
Rate limiting: Per-key and per-wallet rate limits prevent abuse.
Wallet validation: Provider registration requires a valid Ethereum address format.
Admin-only key creation: API keys can only be created with the admin secret.
Provider deduplication: When a provider reconnects with a new WebSocket, stale registrations with the same wallet and name are automatically removed. This prevents ghost providers from inflating the network count.

Smart Contract

The xRAM token is a standard ERC-20 contract deployed on Base mainnet with additional functions for the emission schedule and staking.

Contract Address	`0x3BeB23287f24Db91249D8D90aD61a0e07F4F4C5c`
Network	Base Mainnet (Chain ID: 8453)
View on BaseScan	Explorer Link

Key Functions

balanceOf(address) — Standard ERC-20 balance check
transfer(to, amount) — Standard ERC-20 transfer
approve(spender, amount) — Approve spending (used for escrow)
transferFrom(from, to, amount) — Transfer on behalf (used for withdrawals)

Escrow Flow

User calls transfer(escrow_address, amount) to deposit.
Coordinator verifies the on-chain transaction and issues a session token.
Escrow calls approve(user_address, remaining) to allow withdrawal.
User calls transferFrom(escrow, self, remaining) to reclaim tokens.

FAQ

What hardware do I need to be a provider?

Any Mac with Apple Silicon (M1 or later). The more RAM you have, the larger the models you can serve. An M1 with 8GB can run the 1B and 3B models; an M3 Max with 64GB can run everything up to and including 70B models. For truly massive models (405B, 671B), RAM Aggregator uses pipeline parallelism to split the model across multiple Macs — so even a few 32GB machines working together can serve a 405B model.

Is my data private?

RAM Aggregator offers end-to-end encryption that prevents the coordinator (our infrastructure) from reading your prompts or responses. When E2E is enabled, your data is encrypted in your browser before it ever leaves your device and can only be decrypted by the assigned provider. The provider must see your prompt to run inference — this is inherent to how AI models work. However, providers don't log or store your data, and they cannot identify you (your wallet address is not shared with them). See the Privacy & Encryption section for a full data visibility breakdown.

What does E2E encryption protect?

E2E encryption prevents the coordinator (our central server) from seeing your prompt and response content. It uses X25519 key exchange and AES-256-GCM encryption. The assigned provider can still see your data because it must run inference on it — but it cannot identify who you are, and it does not store your data. Think of it like end-to-end encryption in messaging apps: the server that routes messages can't read them, but the recipient (the provider running your model) can.

Does the provider store my data?

No. Providers process your prompt in memory, generate a response, and discard both immediately. There is no logging, no persistent storage, and no data retention on provider nodes. Providers also cannot see your wallet address or correlate your requests across sessions when E2E is enabled.

How much can I earn as a provider?

Earnings depend on the current emission epoch, how many jobs you complete, and your staking level. Early providers in Epoch 1 earn the most. Check the emission schedule section for detailed rates.

Can I run this on Linux or Windows?

Currently, the provider app only supports macOS with Apple Silicon due to the MLX framework requirement. The user chat app works in any browser.

Is xRAM a real cryptocurrency?

xRAM is a real ERC-20 token on Base mainnet (Coinbase's L2). It has a fixed supply of 100M tokens with a Bitcoin-style halving emission schedule.

What happens if a provider goes offline mid-job?

The coordinator detects offline providers via heartbeat monitoring. If a provider disconnects during a job, the job is automatically re-queued and assigned to another available provider. When a provider reconnects, stale registrations are cleaned up so the network always shows accurate provider counts.

How does pipeline parallelism work?

When a model is too large for any single Mac, the coordinator's shard scheduler splits it across multiple providers. Each provider loads a slice of the model's transformer layers and listens on a TCP port. During inference, hidden states flow through the pipeline: Shard 0 processes the first layers and passes its output to Shard 1, which processes the next set and passes to Shard 2, and so on. The final shard produces the output logits. This happens automatically — users just request a model and get a response.

Why does my first request take longer?

If the requested model isn't already loaded in a provider's memory, it needs to be loaded from disk (or downloaded first). This can take 10–60 seconds depending on model size. Subsequent requests to the same model are much faster since the model stays in memory. The coordinator routes to providers that already have the model loaded whenever possible.

Can I limit how much of my Mac's resources are used?

Yes. The menubar app includes a RAM Allocation slider that lets you cap how much memory RAM Aggregator can use. Models that exceed your cap are automatically disabled. You can also enable Prevent Sleep to keep your Mac serving jobs while you're away, or disable it to let your Mac sleep normally when idle.

Does the app update automatically?

Yes, if auto-update is enabled (the default). The coordinator can push updates to all connected providers. When an update arrives, your daemon downloads the latest version, applies it, and restarts. You can disable auto-update in your config file if you prefer manual control.