RAM Aggregator

The decentralized AI inference network powered by Apple Silicon.

RAM Aggregator is a peer-to-peer network that connects people who want to use AI (users) with people who have spare compute (providers). Users pay with xRAM tokens to run LLM inference; providers earn xRAM by sharing their Mac's idle processing power.

The entire system runs on three pillars:

  1. Coordinator — The central routing layer that matches jobs to providers, manages the token ledger, and exposes an OpenAI-compatible API. Includes a shard scheduler for automatically splitting large models across multiple Macs.
  2. Provider Nodes — macOS menubar apps that connect to the coordinator via WebSocket, download models from HuggingFace, and run inference locally using Apple's MLX framework. Providers can serve full models or individual shards as part of a distributed pipeline.
  3. xRAM Token — An ERC-20 token on Base (Coinbase's L2) that powers all payments and rewards. 100M fixed supply with a Bitcoin-style halving emission schedule.

Highlights

Current Status RAM Aggregator is live on Base mainnet. The smart contract is deployed, the coordinator is running on Fly.io, and the provider app is available for macOS. Pipeline parallelism, smart routing, and native chat templates are all live.

Quickstart: Users

Start chatting with AI in under a minute.

Option 1: Free Demo

Visit the chat app and start typing. The app uses a shared demo key (xram_free_test) that gives you a handful of free messages per day to try the network.

Option 2: Connect MetaMask (Unlimited)

  1. Open the chat app and click Connect Wallet.
  2. Switch MetaMask to the Base network (the app will prompt you).
  3. Click Deposit xRAM and choose an amount. Your xRAM tokens are transferred to the escrow smart contract.
  4. A session token is issued automatically. You can now chat with any model, paying per token from your deposit.
  5. When you're done, click Withdraw to reclaim your unused balance back to your wallet.

Option 3: OpenAI SDK (for Developers)

Point any OpenAI-compatible client at the RAM Aggregator API:

from openai import OpenAI

client = OpenAI(
    base_url="https://ramaggregator.com/v1",
    api_key="your_session_token_here"  # sess_... from deposit
)

response = client.chat.completions.create(
    model="llama-3.2-3b",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

Quickstart: Providers

Turn your Mac into an AI inference node and start earning xRAM tokens.

Requirements

Installation

  1. Download the RAM Aggregator macOS app (DMG installer).
  2. Drag it to Applications and launch. A chip icon appears in your menu bar.
  3. The app creates a Python virtual environment and installs dependencies automatically.

Configuration

  1. Click the menubar icon and select Rename Node to set a unique name for your provider.
  2. Open the Models submenu and click Enable on any models you want to serve.
  3. Models that exceed your system RAM are automatically greyed out and non-selectable.
  4. The model will download from HuggingFace (one-time) and then show as [ready].
  5. Use the RAM Allocation slider to control how much of your system memory is available for models (e.g., cap at 16GB if you want to keep resources free for other work).
  6. Enable Prevent Sleep if you want your Mac to stay awake and serve jobs while unattended.

Running

Click Start Daemon from the menubar. Your node connects to the coordinator, reports its available models, and begins accepting inference jobs. The status indicator turns green when you're online.

You can enable multiple models simultaneously (RAM permitting). The coordinator routes jobs to you based on which models you have loaded, your available memory, and your reputation score.

If the coordinator requests it, your node can also serve as a shard worker for pipeline parallelism — loading a portion of a large model and processing hidden states as part of a distributed pipeline. This happens automatically; no extra configuration needed.

Earning

Every time you complete an inference job, xRAM tokens are minted on-chain to your wallet address. The amount depends on the current emission epoch, the number of tokens generated, and your staking multiplier. You can check your earnings from the menubar's dashboard link.

Connection Resilience

The provider daemon automatically reconnects if the WebSocket connection drops. On reconnection, stale registrations are cleaned up so the coordinator always shows an accurate count of active providers. Inference runs in a dedicated thread so that long generation jobs (40+ seconds for large models) don't block WebSocket keepalive pings.

Uninstalling Models

To remove a model: open the Models submenu, expand the model, and click Uninstall (delete files). This removes the model from disk, HuggingFace cache, and memory.

System Architecture

RAM Aggregator is a hub-and-spoke architecture with a central coordinator and distributed provider nodes. It supports two inference modes: single-provider (for models that fit in one Mac's memory) and pipeline-parallel (for models that need to be split across multiple Macs).


  [Users / AI Agents]
        |
        | HTTPS (OpenAI-compatible API)
        v
  [Coordinator + Shard Scheduler]  <--->  [Base L2 Smart Contract]
        |                                        (xRAM ERC-20)
        | WebSocket
        |
   ┌────┴─────────────────────────────────┐
   |                                       |
   |  Single Provider         Pipeline     |
   |  (fits in 1 Mac)        Parallelism   |
   |                         (split model) |
   v                                       v
  [Provider 1]              [Shard 0] ──TCP──> [Shard 1] ──TCP──> [Shard 2]
   M3 Max, 64GB              Mac #1              Mac #2             Mac #3
   Qwen3 8B (full)           Layers 0-10         Layers 11-21      Layers 22-32
  

Coordinator

The coordinator is a FastAPI server deployed on Fly.io. It handles:

Provider Nodes

Each provider runs a macOS menubar app that maintains a persistent WebSocket connection to the coordinator. Providers report their hardware specs, loaded models, and availability via heartbeats every 10 seconds.

When a job arrives, the provider runs inference using Apple's MLX framework. Inference executes in a dedicated thread (asyncio.to_thread) so the asyncio event loop stays responsive for WebSocket pings and heartbeats, even during long generation runs. Prompts are formatted using each model's native chat template (e.g., <|im_start|> for Qwen3, <|begin_of_text|> for Llama) to ensure clean, accurate output.

Providers can also act as shard workers for pipeline parallelism. When the coordinator assigns a shard, the provider loads only the specified transformer layers and listens on a TCP port for hidden state data from the previous shard in the pipeline.

On-Chain Layer

The xRAM ERC-20 token is deployed on Base mainnet. The coordinator holds a signer key that can mint tokens from the treasury allocation. User deposits are handled via an escrow wallet with approve/transferFrom for withdrawals.

Job Flow

Here's what happens when a user sends a chat message:

  1. Request — User sends a chat message via the web UI or OpenAI API. The gateway validates their API key or session token.
  2. Model Resolution — The gateway resolves the model alias (e.g., qwen3-8b) to the full HuggingFace ID and formats the prompt using the model's native chat template.
  3. Routing Decision — The coordinator checks if any single provider can serve the model. If not, it checks for an existing pipeline shard group or triggers automatic shard scheduling.
  4. Dispatch (Single Provider) — For models that fit in one Mac, the dispatcher selects the best provider: it prefers those with the model already loaded, then falls back to any online provider with enough RAM.
  5. Dispatch (Pipeline) — For models split across multiple Macs, the coordinator routes the request to the pipeline orchestrator, which sends hidden states through each shard worker in sequence.
  6. Inference — The provider runs inference through MLX in a dedicated thread (keeping the WebSocket alive), applies the model's native chat template, strips any internal reasoning tags, and returns the result.
  7. Validation — The coordinator validates the result (output quality, timing plausibility, self-dealing, duplicate detection) before paying the reward.
  8. Payment — If validation passes, xRAM tokens are minted on-chain to the provider's wallet. The user's session deposit is deducted based on dynamic pricing.

Distributed Inference

Some models are too large for any single Mac. A 405B-parameter model needs roughly 250 GB of RAM — more than most machines have. RAM Aggregator solves this with pipeline parallelism: the model's transformer layers are split into shards, each shard runs on a different Mac, and hidden states flow through the pipeline via TCP.

How Pipeline Parallelism Works

A transformer model is a stack of identical layers. If a model has 60 layers and three Macs are available, the coordinator splits it into three shards:


  Mac #1 (Shard 0):  Layers 0–19   + Embedding layer
  Mac #2 (Shard 1):  Layers 20–39
  Mac #3 (Shard 2):  Layers 40–59  + Output head (logits)

  Flow for each token:
    [Prompt] → Embed → Shard 0 → TCP → Shard 1 → TCP → Shard 2 → Logits → [Token]
  

Each shard worker loads only its assigned layers into memory. The orchestrator (running on the coordinator or the first shard) manages the autoregressive generation loop: it tokenizes the input, computes embeddings, sends hidden states through the pipeline, collects logits from the final shard, samples the next token, and repeats.

Automatic Shard Scheduling

When a user requests a model that no single provider can serve, the coordinator's shard scheduler takes over:

  1. It queries the model's layer count and estimates RAM per shard.
  2. It finds online providers with enough free memory to hold at least one shard.
  3. It assigns shards to providers and sends LOAD_SHARD commands via WebSocket.
  4. Each provider downloads the model (if needed), loads its assigned layers, and starts a TCP shard worker.
  5. When all shards report SHARD_READY, the pipeline is marked as complete and ready for inference.

This entire process is transparent to the user. They request llama-3.1-405b and get a response — they don't need to know it was split across four Macs.

Shard Worker Architecture

Each shard worker is a lightweight TCP server that handles three message types:

Hidden states are serialized as compact binary arrays (MLX arrays → numpy → bytes) with shape/dtype metadata. The protocol uses length-prefixed JSON headers for routing and raw binary payloads for tensor data.

Performance Pipeline parallelism adds latency per token (each shard hop is a TCP round-trip) but makes previously impossible models accessible. A Llama 3.1 405B that would need 250 GB of RAM can run across three M4 Pro Macs with 96 GB each. The throughput scales with the number of shards and the network bandwidth between them.
Network Requirements For best pipeline performance, shard workers should be on the same local network (LAN). The coordinator assigns shards to providers that can reach each other. In practice, this means running multiple Macs in the same home or office. WAN pipeline parallelism is possible but adds significant per-token latency.

Smart Routing

The coordinator makes intelligent decisions about where to send each inference request. Here's the priority order:

Provider Selection

  1. Model already loaded — The coordinator prefers providers that have the requested model loaded and ready in memory. This avoids model loading time (which can be 10–60 seconds for larger models).
  2. Preferred provider — If the client specifies a preferred_provider ID (e.g., for testing or affinity), the coordinator routes to that provider if it's online.
  3. Best available — Among providers with the right model, the coordinator picks the one with the highest available_memory × reputation score.
  4. Lazy-load fallback — If no provider has the model loaded, any online provider with enough RAM is selected. The model will be downloaded and loaded on first use.
  5. Pipeline fallback — If no single provider has enough RAM, the shard scheduler splits the model across multiple providers.

Chat Template Formatting

Different model families require different prompt formats. The coordinator sends prompts in a generic System: ... / User: ... / Assistant: format, and the provider's inference engine applies the correct chat template using the tokenizer's built-in apply_chat_template function. This ensures each model sees its native token format:

Model FamilyTemplate StyleSpecial Handling
Llama 3.x<|begin_of_text|> + role headersStandard instruct format
Qwen 2.5 / 3<|im_start|> ChatML-styleQwen3: /no_think directive to suppress chain-of-thought
Mistral[INST] markersStandard instruct format
DeepSeekChatML-style<think> tags stripped from output

If a model includes internal reasoning in <think>...</think> tags, the provider automatically strips these before returning the response, so users see only the final answer.

Model Catalog

All models are 4-bit quantized MLX versions from the mlx-community HuggingFace organization.

Edge & Small (8–16 GB RAM)

ModelParamsRAMSpeedBest For
Llama 3.2 1B Instruct1B~1.5 GB~150 tok/sSimple tasks, fast replies
Llama 3.2 3B Instruct3B~3 GB~100 tok/sBalanced speed & quality
Mistral 7B Instruct v0.37B~5.5 GB~60 tok/sReasoning, instruction following
Qwen 2.5 7B Instruct7B~5.5 GB~60 tok/sMultilingual, coding
Qwen3 8B8B~5.5 GB~55 tok/sLatest gen, built-in thinking mode
Llama 3.1 8B Instruct8B~6 GB~50 tok/sGeneral purpose

Medium (16–64 GB RAM)

ModelParamsRAMSpeedBest For
Qwen3 30B MoE NEW30B (3B active)~18 GB~90 tok/sSmart like 30B, fast like 3B
Qwen3 Coder 30B MoE NEW30B (3B active)~18 GB~90 tok/sSonnet-class agentic coding
Qwen 2.5 Coder 32B Instruct32B~20 GB~25 tok/sCode generation, 80+ languages
DeepSeek R1 Distill 32B32B~20 GB~25 tok/sChain-of-thought reasoning
Llama 3.3 70B Instruct70B~42 GB~12 tok/sLatest Llama, best quality/size
Llama 3.1 70B Instruct70B~42 GB~10 tok/sProven workhorse
Qwen 2.5 72B Instruct72B~45 GB~10 tok/sGPT-4 class multilingual

Large (96–192 GB RAM)

ModelParamsRAMSpeedBest For
Mistral Large 2123B~75 GB~6 tok/sMistral flagship, complex tasks
Qwen3 235B MoE NEW235B (22B active)~135 GB~30 tok/sThinking mode, 119 languages

XL (192–384 GB RAM)

ModelParamsRAMSpeedBest For
Qwen3.5 397B MoE NEW397B (17B active)~225 GB~35 tok/sLatest Qwen flagship, hybrid DeltaNet, 201 languages
Llama 3.1 405B Instruct405B~250 GB~3 tok/sLargest open dense model
Qwen3 Coder 480B MoE NEW480B (35B active)~280 GB~18 tok/sUltimate coding, frontier-class

Ultra (400+ GB RAM)

ModelParamsRAMSpeedBest For
DeepSeek R1 671B671B MoE~400 GB~5 tok/sUltimate reasoning (37B active)
DeepSeek V3 671B671B MoE~400 GB~5 tok/sFlagship general-purpose MoE
Tip Speeds are approximate on M3 Max / M4 Pro. Newer chips like M4 Max and M5 series are significantly faster. MoE (Mixture of Experts) models only activate a fraction of their parameters per token — so Qwen3.5 397B runs at ~35 tok/s despite having 397B total params (only 17B active). The menubar app automatically blocks models that exceed 85% of your system RAM.
512 GB Macs Mac Studio and Mac Pro with M3/M4 Ultra can run models up to 671B parameters entirely in unified memory — including DeepSeek R1, DeepSeek V3, Qwen3 Coder 480B, and Qwen3.5 397B. No consumer GPU setup can match this. This is where RAM Aggregator truly shines.

Provider Menubar App

The RAM Aggregator provider app lives in your macOS menu bar. It's a native Swift app that manages a Python daemon under the hood. Everything is controlled from the menu bar icon — no terminal required.

What You See

The menu bar icon (a chip symbol) shows your connection status at a glance. Click it to access:

Under the Hood

When you click Start Daemon, the app launches a Python process that:

  1. Loads all enabled models into memory using MLX.
  2. Connects to the coordinator via a persistent WebSocket.
  3. Registers its hardware specs, loaded models, and encryption public key.
  4. Sends heartbeats every 10 seconds to stay in the provider registry.
  5. Accepts inference jobs, runs them through MLX, and returns results.

All configuration is stored in ~/.ram-aggregator/config.json. Model weights are cached in the standard HuggingFace cache directory.

Provider Controls & Settings

RAM Allocation

The RAM Allocation slider lets you control exactly how much memory RAM Aggregator can use for models. This is useful if you want to keep some RAM free for other applications while still contributing to the network.

When you lower the RAM limit, models that exceed the new cap are automatically greyed out in the model selector. The daemon reports the capped memory to the coordinator, which takes it into account when routing jobs. For example, if your Mac has 64 GB but you set the limit to 32 GB, you can still serve 7B–32B models comfortably without impacting your other work.

Prevent Sleep

macOS puts your Mac to sleep after a period of inactivity, which disconnects the provider from the network. The Prevent Sleep toggle keeps your Mac awake so it can serve inference jobs around the clock. This is ideal for dedicated provider setups (e.g., a Mac Mini or Mac Studio running headless).

When enabled, the app uses macOS power assertions to prevent system sleep. Display sleep still occurs normally — only system sleep is prevented. Disable the toggle to restore your normal sleep settings.

E2E Encryption Key

On first launch, the daemon generates an X25519 key pair for end-to-end encryption. The private key is stored in ~/.ram-aggregator/encryption_key.bin with restrictive file permissions (0600). The public key is included in every registration and heartbeat message, allowing clients to encrypt prompts specifically for your provider.

Auto-Update System

RAM Aggregator includes an over-the-air update mechanism that keeps providers running the latest version without manual intervention.

How It Works

  1. When a new version is available, the coordinator sends a FORCE_UPDATE message to all connected providers via WebSocket.
  2. The daemon writes a flag file with the target version and executes the update script.
  3. The update script downloads the latest daemon code, updates dependencies, and restarts the process.
  4. The provider reconnects to the coordinator with the new version.

Auto-update can be disabled in ~/.ram-aggregator/config.json by setting "auto_update": false. When disabled, the provider will log a warning about the available update but won't apply it automatically.

Version Management The coordinator tracks each provider's reported version. Providers running outdated versions receive update prompts. Critical security patches may require providers to update before they can continue serving jobs.

xRAM Token Overview

Token NameRAM Aggregator (xRAM)
StandardERC-20
ChainBase Mainnet
Contract0x3BeB23287f24Db91249D8D90aD61a0e07F4F4C5c
Total Supply100,000,000 xRAM
Decimals18
TradeAerodrome (ETH/xRAM)

Allocation

Emission Schedule

xRAM uses a Bitcoin-inspired halving model. The 70M provider treasury is divided into 4 epochs of 17.5M tokens each. Each epoch halves the base reward rate.

EpochTokens AvailableReward MultiplierDaily Cap
1 (0 – 17.5M minted)17,500,0001.0x5,000,000/day
2 (17.5M – 35M)17,500,0000.5x2,500,000/day
3 (35M – 52.5M)17,500,0000.25x1,250,000/day
4 (52.5M – 70M)17,500,0000.125x625,000/day

Additionally, hourly emission caps prevent flash-draining. The reward per job is calculated as: base_reward * epoch_multiplier * staking_bonus, capped by both daily and hourly limits.

Early Mover Advantage Providers who join during Epoch 1 earn 8x more per job than those who join in Epoch 4. The halving schedule creates strong incentive to participate early.

Dynamic Pricing

The cost of inference adjusts automatically based on network utilization, similar to Ethereum gas fees.

FactorEffect
Network utilizationHigher utilization → higher prices
Model size70B models cost ~10x more than 1B models
Tokens generatedCost scales linearly with output length

The pricing engine targets 60% network utilization. Prices have a floor of 0.1 xRAM/1K tokens and a ceiling of 50 xRAM/1K tokens. Current pricing can be checked via GET /api/v1/pricing.

Staking

Providers can stake xRAM tokens to increase their reward multiplier and get priority in job routing.

API Authentication

All API requests use the Authorization header with a Bearer token.

Key Types

TypeFormatUse Case
Demoxram_free_testFree testing (rate limited, shared)
Sessionsess_...MetaMask deposit sessions (auto-issued)
Livexram_live_...Production keys (admin-created)
Agentxram_agent_...AI agent keys (admin-created)

Session tokens are issued automatically when you deposit xRAM through the chat app. For programmatic access, use your session token as the API key.

Chat Completions

POST /v1/chat/completions

OpenAI-compatible chat completions endpoint. Works with any OpenAI SDK.

Request Body

{
  "model": "llama-3.2-3b",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
  ],
  "max_tokens": 256,
  "temperature": 0.7,
  "stream": false
}

Response

{
  "id": "chatcmpl-a1b2c3...",
  "object": "chat.completion",
  "model": "llama-3.2-3b",
  "choices": [{
    "index": 0,
    "message": {"role": "assistant", "content": "The capital of France is Paris."},
    "finish_reason": "stop"
  }],
  "usage": {"prompt_tokens": 24, "completion_tokens": 8, "total_tokens": 32},
  "xram_tokens_charged": 0.032,
  "xram_session_remaining": 9967.5
}

Model Aliases

Use the short alias or the full HuggingFace model ID in API requests. Both work.

AliasFull Model ID
llama-3.2-1bmlx-community/Llama-3.2-1B-Instruct-4bit
llama-3.2-3bmlx-community/Llama-3.2-3B-Instruct-4bit
llama-3.1-8bmlx-community/Meta-Llama-3.1-8B-Instruct-4bit
mistral-7bmlx-community/Mistral-7B-Instruct-v0.3-4bit
qwen-2.5-7bmlx-community/Qwen2.5-7B-Instruct-4bit
qwen3-8bmlx-community/Qwen3-8B-4bit
qwen3-30bmlx-community/Qwen3-30B-A3B-4bit
qwen3-coder-30bmlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit
qwen-2.5-coder-32bmlx-community/Qwen2.5-Coder-32B-Instruct-4bit
deepseek-r1-distill-32bmlx-community/DeepSeek-R1-Distill-Qwen-32B-4bit
llama-3.3-70bmlx-community/Llama-3.3-70B-Instruct-4bit
llama-3.1-70bmlx-community/Meta-Llama-3.1-70B-Instruct-4bit
qwen-2.5-72bmlx-community/Qwen2.5-72B-Instruct-4bit
mistral-large-2mlx-community/Mistral-Large-Instruct-2407-4bit
qwen3-235bmlx-community/Qwen3-235B-A22B-4bit
qwen3.5-397bmlx-community/Qwen3.5-397B-A17B-nvfp4
llama-3.1-405bmlx-community/Meta-Llama-3.1-405B-Instruct-4bit
qwen3-coder-480bmlx-community/Qwen3-Coder-480B-A35B-Instruct-4bit
deepseek-r1mlx-community/DeepSeek-R1-4bit
deepseek-v3mlx-community/DeepSeek-V3-4bit

Models

GET /v1/models

List available models on the network. Returns models that at least one online provider has loaded, or the full catalog if no providers are online.

GET /api/v1/models/marketplace

Enriched model listing with provider counts, real-time pricing, and performance metrics for each model.

Deposits & Sessions

POST /api/v1/deposit/verify

After transferring xRAM to the escrow address via MetaMask, call this endpoint with the transaction hash to verify the deposit on-chain and receive a session token.

{
  "tx_hash": "0xabc123...",
  "wallet_address": "0xYourWalletAddress..."
}
GET /api/v1/deposit/balance

Check your remaining deposit balance. Requires session token in Authorization header.

POST /api/v1/deposit/withdraw

Close your session and withdraw remaining balance. The escrow contract is updated to allow you to call transferFrom to reclaim your tokens on-chain.

GET /api/v1/config/escrow

Get the escrow wallet address and contract details needed for MetaMask deposits.

E2E Encryption Endpoints

These endpoints implement the two-phase encrypted inference protocol. Use them when you want end-to-end encryption so the coordinator cannot see your prompts or responses.

GET /v1/e2e/keys

List online providers with their X25519 public encryption keys. Public endpoint — no authentication required. Use this to inspect which providers support E2E before initiating a request.

POST /v1/e2e/init

Phase 1: Send job metadata (model, max_tokens, temperature) without any prompt data. The coordinator assigns a provider and returns their public key plus a job ID.

// Request
{ "model": "qwen3-8b", "max_tokens": 256, "temperature": 0.7 }

// Response
{
  "job_id": "abc123...",
  "provider_id": "p_xyz",
  "provider_name": "Steve's MacBook",
  "provider_public_key_b64": "base64-encoded X25519 public key"
}
POST /v1/e2e/prompt

Submit the encrypted prompt for a previously initialized job. The coordinator forwards the encrypted blob to the assigned provider without decryption.

// Request
{
  "job_id": "abc123...",
  "encrypted_prompt_b64": "base64-encoded AES-256-GCM ciphertext",
  "prompt_nonce_b64": "base64-encoded 96-bit nonce",
  "client_public_key_b64": "base64-encoded ephemeral X25519 public key"
}

// Response (encrypted)
{
  "encrypted": true,
  "encrypted_output_b64": "base64-encoded ciphertext",
  "output_nonce_b64": "base64-encoded nonce",
  "xram_job_id": "abc123...",
  "xram_tokens_charged": 0.15
}

Providers

GET /api/v1/providers

List online providers with hardware specs, loaded models, and reputation scores. Financial details (earnings, wallet addresses) require authentication.

GET /api/v1/network

Network-wide statistics: total providers, online count, available memory, and tokens minted.

GET /api/v1/pricing

Current dynamic pricing state: utilization, price multiplier, and base rates.

GET /api/v1/pricing/quote

Get a price quote for a specific model and token count. Parameters: model (e.g., "7B"), max_tokens (e.g., 256).

Privacy Overview

RAM Aggregator is a decentralized network. Understanding who can see what is essential to using it with confidence. This section explains exactly what data is collected, who can access it, and how our encryption features protect you.

Our Privacy Commitment We believe in radical transparency. Rather than burying details in legal boilerplate, this page tells you plainly and precisely what data is visible, to whom, and what protections exist. If something isn't encrypted, we'll say so.

The Three Parties

Every inference request involves three parties. Each has different levels of data access:

  1. You (the User) — You control your prompts, your wallet, and whether to enable end-to-end encryption.
  2. The Coordinator — The central routing server. It matches your request to a provider and manages billing. Think of it like a postal service: it needs to know the destination, but doesn't need to read the letter.
  3. The Provider — The Mac running your inference. The provider must see your prompt to generate a response — this is fundamental to how AI inference works. You cannot ask a model to answer a question without the model seeing the question.

Transport Security

All connections use TLS 1.3 (HTTPS / WSS). This protects against eavesdroppers on the network (ISPs, Wi-Fi snoopers, etc.). TLS encrypts data in transit between you and the coordinator, and between the coordinator and providers. However, TLS alone does not prevent the coordinator from reading data that passes through it — that's what E2E encryption addresses.

What We Don't Collect

End-to-End Encryption

RAM Aggregator supports optional end-to-end (E2E) encryption that prevents the coordinator from reading your prompts and responses. When enabled, only you and the assigned provider can see the content of your conversation.

How It Works

E2E encryption uses a two-phase protocol built on industry-standard cryptographic primitives:

  1. Key Exchange (Phase 1) — Your browser generates an ephemeral X25519 key pair. The coordinator assigns a provider and returns that provider's public key. Neither the coordinator nor anyone else learns your private key.
  2. Encrypted Inference — Your browser uses Elliptic Curve Diffie-Hellman (ECDH) to derive a shared secret with the provider. Your prompt is encrypted with AES-256-GCM before leaving your browser. The coordinator receives only an opaque encrypted blob and forwards it untouched to the provider. The provider decrypts the prompt, runs inference, encrypts the response with the same shared key, and sends it back. The coordinator never sees the plaintext.

Cryptographic Details

Key AgreementX25519 ECDH (Curve25519)
Key DerivationHKDF-SHA256 with info string xram-e2e-v1
Symmetric EncryptionAES-256-GCM (authenticated encryption)
Client ImplementationWeb Crypto API (SubtleCrypto) — zero external dependencies
Provider ImplementationPython cryptography library (OpenSSL-backed)
Key LifetimeEphemeral — new key pair generated per request
NoncesRandom 96-bit, unique per encryption operation

How to Enable E2E Encryption

In the chat app, click the lock icon next to the temperature selector. When the toggle is green and shows "E2E Encrypted", all subsequent messages will be encrypted. You'll see a green "E2E" badge on each encrypted message.

Performance Impact E2E encryption adds less than 0.5 milliseconds of overhead per request. The cryptographic operations (key generation, ECDH, AES-GCM) are extremely fast on modern hardware. The only measurable addition is one extra HTTP round-trip for the key exchange phase, which typically adds 10-100ms depending on your network latency. For inference requests that take 2-30+ seconds, this is effectively invisible.

What E2E Encryption Protects Against

What E2E Encryption Does NOT Protect Against

We believe in being completely honest about limitations:

Important E2E encryption protects your data from the coordinator (the infrastructure), not from the provider (the compute node). If you need complete privacy from all parties, that would require homomorphic encryption or secure enclaves — technologies that are not yet practical for LLM inference. We are tracking these developments and will adopt them when viable.

Data Visibility Matrix

This table shows exactly who can see what, depending on whether E2E encryption is enabled. We publish this so you can make informed decisions about when to use encryption.

With E2E Encryption OFF (Default)

DataYouCoordinatorProviderOther Users
Your prompt textYesYesYesNo
AI response textYesYesYesNo
Model usedYesYesYesNo
Token countYesYesYesNo
Your wallet addressYesYesNoNo
Your IP addressYesYesNoNo
Provider identityYesYesYesNo

With E2E Encryption ON

DataYouCoordinatorProviderOther Users
Your prompt textYesNo 🔒YesNo
AI response textYesNo 🔒YesNo
Model usedYesYesYesNo
Token count (approx.)YesYes*YesNo
Your wallet addressYesYesNoNo
Your IP addressYesYesNoNo
Encrypted payload sizeYesYesYesNo

* The coordinator can infer approximate token counts from encrypted payload sizes but cannot see actual content. Green "No" means the party cannot access this data. Amber "Yes" means the party can see this data.

Provider Trust Model

In any decentralized inference network, you are trusting the provider to honestly execute your inference. This is similar to how cloud computing works: when you use AWS, Azure, or any cloud API, the server running your code can see your data. The difference with RAM Aggregator is that:

Comparison to Centralized AI Services When you use ChatGPT, Claude, or Gemini, the service provider sees all your prompts, stores conversation history, and may use your data for training. With RAM Aggregator + E2E encryption, neither the coordinator nor any centralized entity has access to your conversation content. The individual provider sees your prompt only for the duration of inference and has no persistent storage or logging of your data.

Anti-Gaming & Security

RAM Aggregator includes multiple layers of protection to prevent reward manipulation.

Validation Checks

Every inference result passes through six validation checks before rewards are paid:

  1. Response quality — Rejects empty or trivially short outputs.
  2. Token count plausibility — Detects inflated token_generated claims by comparing to actual output length.
  3. Timing validation — Flags responses that are faster than physically possible for the model size on Apple Silicon.
  4. Self-dealing detection — Blocks rewards when the provider and client wallets are the same.
  5. Duplicate output detection — Identifies repeated identical responses (copy-paste farming).
  6. Entropy check — Rejects very low-entropy outputs (repetitive junk like "the the the...").

Proof of Inference

The coordinator periodically sends challenge prompts to random online providers with known expected answers. Providers that return incorrect results receive reputation penalties, reducing their priority for future jobs.

Validation Under Encryption

When E2E encryption is enabled, the coordinator cannot read prompt or response content. Some validation checks are adjusted accordingly:

CheckPlaintextEncryptedNotes
Response qualityActiveSkippedCan't read encrypted output
Token count plausibilityActiveActiveMetadata, not encrypted
Timing validationActiveActiveMetadata, not encrypted
Self-dealing detectionActiveActiveWallet comparison
Duplicate output detectionActiveSkippedCan't hash encrypted blobs
Entropy checkActiveSkippedCan't read encrypted output
Replay protectionActiveActiveNonce-based, unchanged

Content-based checks are unavailable under encryption, but timing, metadata, and identity checks remain fully active. Provider reputation scoring over time compensates for the reduced validation surface.

Additional Protections

Smart Contract

The xRAM token is a standard ERC-20 contract deployed on Base mainnet with additional functions for the emission schedule and staking.

Contract Address0x3BeB23287f24Db91249D8D90aD61a0e07F4F4C5c
NetworkBase Mainnet (Chain ID: 8453)
View on BaseScanExplorer Link

Key Functions

Escrow Flow

  1. User calls transfer(escrow_address, amount) to deposit.
  2. Coordinator verifies the on-chain transaction and issues a session token.
  3. Escrow calls approve(user_address, remaining) to allow withdrawal.
  4. User calls transferFrom(escrow, self, remaining) to reclaim tokens.

FAQ

What hardware do I need to be a provider?

Any Mac with Apple Silicon (M1 or later). The more RAM you have, the larger the models you can serve. An M1 with 8GB can run the 1B and 3B models; an M3 Max with 64GB can run everything up to and including 70B models. For truly massive models (405B, 671B), RAM Aggregator uses pipeline parallelism to split the model across multiple Macs — so even a few 32GB machines working together can serve a 405B model.

Is my data private?

RAM Aggregator offers end-to-end encryption that prevents the coordinator (our infrastructure) from reading your prompts or responses. When E2E is enabled, your data is encrypted in your browser before it ever leaves your device and can only be decrypted by the assigned provider. The provider must see your prompt to run inference — this is inherent to how AI models work. However, providers don't log or store your data, and they cannot identify you (your wallet address is not shared with them). See the Privacy & Encryption section for a full data visibility breakdown.

What does E2E encryption protect?

E2E encryption prevents the coordinator (our central server) from seeing your prompt and response content. It uses X25519 key exchange and AES-256-GCM encryption. The assigned provider can still see your data because it must run inference on it — but it cannot identify who you are, and it does not store your data. Think of it like end-to-end encryption in messaging apps: the server that routes messages can't read them, but the recipient (the provider running your model) can.

Does the provider store my data?

No. Providers process your prompt in memory, generate a response, and discard both immediately. There is no logging, no persistent storage, and no data retention on provider nodes. Providers also cannot see your wallet address or correlate your requests across sessions when E2E is enabled.

How much can I earn as a provider?

Earnings depend on the current emission epoch, how many jobs you complete, and your staking level. Early providers in Epoch 1 earn the most. Check the emission schedule section for detailed rates.

Can I run this on Linux or Windows?

Currently, the provider app only supports macOS with Apple Silicon due to the MLX framework requirement. The user chat app works in any browser.

Is xRAM a real cryptocurrency?

xRAM is a real ERC-20 token on Base mainnet (Coinbase's L2). It has a fixed supply of 100M tokens with a Bitcoin-style halving emission schedule.

What happens if a provider goes offline mid-job?

The coordinator detects offline providers via heartbeat monitoring. If a provider disconnects during a job, the job is automatically re-queued and assigned to another available provider. When a provider reconnects, stale registrations are cleaned up so the network always shows accurate provider counts.

How does pipeline parallelism work?

When a model is too large for any single Mac, the coordinator's shard scheduler splits it across multiple providers. Each provider loads a slice of the model's transformer layers and listens on a TCP port. During inference, hidden states flow through the pipeline: Shard 0 processes the first layers and passes its output to Shard 1, which processes the next set and passes to Shard 2, and so on. The final shard produces the output logits. This happens automatically — users just request a model and get a response.

Why does my first request take longer?

If the requested model isn't already loaded in a provider's memory, it needs to be loaded from disk (or downloaded first). This can take 10–60 seconds depending on model size. Subsequent requests to the same model are much faster since the model stays in memory. The coordinator routes to providers that already have the model loaded whenever possible.

Can I limit how much of my Mac's resources are used?

Yes. The menubar app includes a RAM Allocation slider that lets you cap how much memory RAM Aggregator can use. Models that exceed your cap are automatically disabled. You can also enable Prevent Sleep to keep your Mac serving jobs while you're away, or disable it to let your Mac sleep normally when idle.

Does the app update automatically?

Yes, if auto-update is enabled (the default). The coordinator can push updates to all connected providers. When an update arrives, your daemon downloads the latest version, applies it, and restarts. You can disable auto-update in your config file if you prefer manual control.