Skip to main content

GPU Client Kernels

This guide explains how CUDA or ROCm kernels can submit tasks to the Chimaera runtime and receive results, using the same client API as host code. It also covers cross-device routing: CPU code sending tasks to run on the GPU, and GPU kernels sending tasks to run on the CPU.

Overview

The Chimaera runtime supports four GPU task routing modes:

Routing ModeDirectionDescription
PoolQuery::Local()GPU → GPUGPU kernel submits a task processed by the GPU work orchestrator
PoolQuery::LocalGpuBcast()CPU → GPUCPU submits a task processed by the GPU work orchestrator
PoolQuery::ToLocalGpu(gpu_id)CPU → GPUCPU submits a task to a specific GPU device
PoolQuery::ToLocalCpu()GPU → CPUGPU kernel submits a task processed by a CPU worker

All paths use the same client.AsyncMyTask(query, ...) API. The IpcManager::Send() method detects the routing mode and dispatches to the correct transport automatically.

Architecture

CPU Host                                          GPU Work Orchestrator
| |
| AsyncGpuSubmit(LocalGpuBcast()) |
| -> Send() detects LocalGpuBcast |
| -> SendToGpu(): serialize + push cpu2gpu_queue ---> | pop + deserialize
| | dispatch to GpuRuntime
| Wait(): spin on FUTURE_COMPLETE | serialize output
| <- deserialize output from FutureShm <--- | threadfence_system + FUTURE_COMPLETE
| |
GPU Kernel CPU Worker
| |
| AsyncGpuSubmit(ToLocalCpu()) |
| -> SendGpu(): serialize + push gpu2cpu_queue ---> | dequeue + deserialize
| | execute task
| Wait(): RecvGpu reads output ring buffer <--- | serialize output + FUTURE_COMPLETE
| |
GPU Kernel GPU Work Orchestrator
| |
| AsyncGpuSubmit(Local()) |
| -> SendGpuLocal(): push gpu2gpu_queue ---> | pop + deserialize
| | dispatch to GpuRuntime
| Wait(): RecvGpu reads output ring buffer <--- | serialize output + FUTURE_COMPLETE

Runtime vs. Client Process Execution Modes

The Chimaera runtime supports two process configurations for GPU task submission:

ModeInit CallGPU ResourcesUse Case
In-process runtimeCHIMAERA_INIT(kClient, true)Full: all GPU queues, orchestrator, GPU memory backendsSingle process with embedded server
Separate client processCHIMAERA_INIT(kClient, false)None: no GPU queues or orchestratorClient connects to standalone server

In-Process Runtime (Embedded Server)

When CHIMAERA_INIT(kClient, true) is called, the process runs both client and server:

// One process does everything
chi::CHIMAERA_INIT(chi::ChimaeraMode::kClient, true);

chimaera::my_module::Client client(pool_id);
client.AsyncCreate(chi::PoolQuery::Dynamic(), name, pool_id).Wait();

// All GPU routing modes work:
// Local(), LocalGpuBcast(), ToLocalGpu(), ToLocalCpu()
auto future = client.AsyncGpuSubmit(chi::PoolQuery::LocalGpuBcast(), 0, value);
future.Wait();

What gets initialized:

  1. ServerInit() runs in sequence:
    • IpcManager::ServerInit()ServerInitGpuQueues() creates all GPU queue infrastructure
    • WorkOrchestrator::Init() + StartWorkers() launches worker threads
    • PoolManager::ServerInit() initializes pool management
    • LaunchGpuOrchestrator() launches persistent GPU kernel (deferred until after pools are composed)
    • StartLocalServer() opens the SHM segment for external clients
  2. ClientInit() initializes the client-side IPC connection

All four GPU routing modes are available because the process owns the GPU queues and orchestrator.

Separate Client Process

When CHIMAERA_INIT(kClient, false) is called, the process is a pure client. During the ClientConnect handshake, the server transmits GPU queue metadata (offsets, backend sizes, IPC handles) back to the client. If the client binary was compiled with CUDA/ROCm support, ClientInitGpuQueues() attaches to the server's GPU queue backends and reconstructs local queue pointers, enabling direct GPU task submission.

// Server is a separate process
// Client connects via SHM, TCP, or IPC transport
chi::CHIMAERA_INIT(chi::ChimaeraMode::kClient, false);

chimaera::my_module::Client client(pool_id);
client.AsyncCreate(chi::PoolQuery::Dynamic(), name, pool_id).Wait();

// PoolQuery::Local() always works (routed to server's CPU worker via SHM/TCP/IPC)
auto future = client.AsyncGpuSubmit(chi::PoolQuery::Local(), 0, value);
future.Wait();
// Server's CPU handler executes: result = test_value * 2 + gpu_id

// With GPU support compiled in, LocalGpuBcast also works (direct GPU queue push)
auto gpu_future = client.AsyncGpuSubmit(chi::PoolQuery::LocalGpuBcast(), 0, value);
gpu_future.Wait();
// Server's GPU handler executes: result = test_value * 2 + gpu_id

What gets initialized:

  1. ClientInit() only — no ServerInit():
    • ConfigManager::Init() reads chimaera.yaml
    • IpcManager::ClientInit() connects to the server via SHM, TCP, or IPC
    • Admin client singleton is created
  2. GPU queue attachment (if compiled with CUDA/ROCm):
    • ClientConnect response carries GPU queue offsets, backend sizes, and IPC handles
    • ClientInitGpuQueues() attaches to each GPU's GpuShmMmap queue backend via shm_attach()
    • cpu2gpu_queues_, gpu2gpu_queues_ are reconstructed from the received offsets
    • RegisterGpuAllocator() is called so the client can resolve ShmPtrs into queue memory
  3. No GPU orchestrator: gpu_orchestrator_ is nullptr (the server runs the orchestrator)

Available routing:

Routing ModeWorks?Why
PoolQuery::Local()YesGoes through SendShm() or SendZmq() to server's CPU worker
PoolQuery::LocalGpuBcast()Yes (with GPU)SendToGpu() pushes to attached cpu2gpu_queue → server's GPU orchestrator
PoolQuery::ToLocalGpu(id)Yes (with GPU)Same path, targeting a specific GPU device
PoolQuery::ToLocalCpu()NoOnly callable from GPU device code

Without GPU support compiled in, LocalGpuBcast and ToLocalGpu return empty futures (no cpu2gpu_queues_ attached). Use PoolQuery::Local() as a fallback — the server's CPU handler executes the task instead.

ClientConnect GPU Queue Handshake

The ClientConnectTask response includes these GPU-specific fields:

FieldTypeDescription
num_gpus_u32Number of GPU devices on the server
cpu2gpu_queue_off_[8]u64[8]Byte offsets of cpu2gpu_queue within each GPU's GpuShmMmap
gpu2cpu_queue_off_[8]u64[8]Byte offsets of gpu2cpu_queue (GPU→CPU)
gpu2gpu_queue_off_[8]u64[8]Byte offsets of gpu2gpu_queue
gpu_queue_backend_size_[8]u64[8]Size of each GPU's GpuShmMmap backend
gpu_queue_depth_u32Queue depth (number of slots)

The client uses these offsets to reconstruct TaskQueue pointers. The cpu2gpu and gpu2cpu queues reside in the pinned host GpuShmMmap backend; the gpu2gpu queue resides in device memory. A single shm_attach() per GPU attaches the pinned backend; the gpu2gpu backend is attached separately via its IPC handle.

Cross-Process GPU Memory Registration

Clients can register GPU device memory backends with the server using the extended RegisterMemoryTask. This enables the server to resolve ShmPtrs that point into GPU memory allocated by a different process.

// Client allocates a GpuMalloc backend
hipc::MemoryBackendId backend_id(static_cast<u32>(getpid()), 100);
hipc::GpuMalloc gpu_backend;
gpu_backend.shm_init(backend_id, data_size, "", gpu_id);

// Get IPC handle from the backend's private header
hipc::GpuMallocPrivateHeader priv_header;
GpuApi::Memcpy(&priv_header, gpu_backend.GetPrivateHeader(), sizeof(priv_header));

// Register with server via RegisterMemory(kGpuDeviceMemory)
auto reg_task = ipc->NewTask<admin::RegisterMemoryTask>(
CreateTaskId(), kAdminPoolId, PoolQuery::Local(),
backend_id,
admin::MemoryType::kGpuDeviceMemory,
gpu_id, data_size, &priv_header.ipc_handle_);
auto future = ipc->SendZmq(reg_task, IpcMode::kTcp);
future.Wait();
// Server opens the IPC handle and registers the allocator in gpu_alloc_map_

The MemoryType enum controls the registration path:

MemoryTypeValueBackendRegistration Path
kCpuMemory0POSIX shared memoryExisting IpcManager::RegisterMemory()
kPinnedHostMemory1GpuShmMmap (pinned)Attach via shm_attach() + RegisterGpuAllocator()
kGpuDeviceMemory2GpuMalloc (device)Open IPC handle via shm_attach_ipc() + RegisterGpuAllocator()

Fork-Based Test Pattern

For testing, a common pattern forks a child server and runs a pure client in the parent:

int main(int argc, char* argv[]) {
// Child process: server mode
if (argc > 1 && std::string(argv[1]) == "--server-mode") {
CHIMAERA_INIT(ChimaeraMode::kServer, true);
sleep(300); // Wait for parent to kill us
return 0;
}

// Parent: fork server, then run as client
pid_t server = fork();
if (server == 0) {
setpgid(0, 0);
execl("/proc/self/exe", argv[0], "--server-mode", nullptr);
_exit(1);
}
setpgid(server, server);
WaitForServer(); // Poll for SHM segment file

// Pure client
setenv("CHI_WITH_RUNTIME", "0", 1);
CHIMAERA_INIT(ChimaeraMode::kClient, false);

// Submit tasks via Local() routing
chimaera::my_module::Client client(pool_id);
client.AsyncCreate(PoolQuery::Dynamic(), name, pool_id).Wait();
auto future = client.AsyncGpuSubmit(PoolQuery::Local(), 0, 42);
future.Wait();
// result = 42 * 2 + 0 = 84 (CPU handler)

// Cleanup
kill(-server, SIGKILL);
waitpid(server, &status, 0);
}

The execl("/proc/self/exe", ...) pattern avoids inheriting the parent's CHIMAERA_INIT static guard state.

See context-runtime/modules/MOD_NAME/test/test_gpu_client_process.cc for a complete implementation of this pattern with 5 test cases covering CPU routing, GPU queue attachment, GPU routing, and cross-process memory registration.

Memory Allocation Architecture

GPU Queue Memory (Server Only)

During ServerInitGpuQueues(), the server allocates pinned host memory backends for GPU communication. Each GPU device gets:

BackendIDTypeSizePurpose
CPU→GPU queue backend1000 + gpu_idGpuShmMmap (pinned)64 MBHolds TaskQueue structures (cpu2gpu, gpu2cpu) and FutureShm allocations for CPU→GPU tasks
GPU→GPU queue backend3000 + gpu_idDevice memory32 MBHolds gpu2gpu_queue and per-thread ArenaAllocator storage for GPU→GPU tasks
Orchestrator scratch2000 + gpu_idGpuShmMmap (pinned)64 MBPer-block ArenaAllocator storage for the persistent orchestrator kernel
GPU heap9000 + gpu_idGpuMalloc (device)64 MBPer-thread BuddyAllocator (CHI_GPU_HEAP_T) for serialization scratch buffers

GpuShmMmap backends use cudaHostAlloc (pinned host memory) accessible from both CPU and GPU via UVA. GpuMalloc backends use cudaMalloc (device memory) for GPU-only allocations.

Server Process Memory Layout (per GPU):
+-----------------------------------------------------------------+
| GpuShmMmap (backend ID 1000+gpu_id, 64 MB pinned host memory) |
| URL: /chi_gpu_queue_{gpu_id} |
| |
| +-----------+ +-----------+ |
| | TaskQueue | | TaskQueue | |
| | gpu2cpu_q | | cpu2gpu_q | |
| | (GPU->CPU)| | (CPU->GPU)| |
| +-----------+ +-----------+ |
| |
| +-----------------------------+ |
| | FutureShm allocations | (allocated by SendToGpu) |
| | [FutureShm + copy_space] | |
| +-----------------------------+ |
+-----------------------------------------------------------------+

+-----------------------------------------------------------------+
| Device memory (backend ID 3000+gpu_id, 32 MB) |
| URL: /chi_gpu2gpu_queue_{gpu_id} |
| |
| +-----------+ |
| | TaskQueue | |
| | gpu2gpu_q | |
| | (GPU->GPU)| |
| +-----------+ |
| |
| +--------------------------------------------+ |
| | ArenaAllocator per-thread (GPU->GPU tasks) | |
| +--------------------------------------------+ |
+-----------------------------------------------------------------+

+-----------------------------------------------------------------+
| GpuShmMmap (backend ID 2000+gpu_id, 64 MB pinned host memory) |
| URL: /chi_gpu_orchestrator_{gpu_id} |
| |
| +--------------------------------------------+ |
| | Block 0: ArenaAllocator (per-thread bump) | |
| +--------------------------------------------+ |
| | Block 1: ArenaAllocator | |
| +--------------------------------------------+ |
| | ... | |
| +--------------------------------------------+ |
+-----------------------------------------------------------------+

+-----------------------------------------------------------------+
| GpuMalloc (backend ID 9000+gpu_id, 64 MB device memory) |
| URL: /chi_gpu_heap_{gpu_id} |
| |
| +--------------------------------------------+ |
| | Block 0: BuddyAllocator (CHI_GPU_HEAP_T) | |
| +--------------------------------------------+ |
| | Block 1: BuddyAllocator | |
| +--------------------------------------------+ |
| | ... | |
| +--------------------------------------------+ |
+-----------------------------------------------------------------+

Client Process Memory

A pure client (CHIMAERA_INIT(kClient, false)) does not create GPU backends, but attaches to the server's GPU queue backends during ClientInitGpuQueues() (if compiled with GPU support). Tasks can be sent through:

  • SHM mode: FutureShm allocated in the server's shared memory segment (chi_main_segment). The client writes serialized task data into FutureShm::copy_space via ShmTransport::Send().
  • TCP/IPC mode: FutureShm allocated via HSHM_MALLOC (process-local heap). Task data serialized and sent through ZMQ DEALER socket.
  • Direct GPU queue push (GPU-compiled clients): SendToGpu() serializes into the attached GpuShmMmap queue backend and pushes to cpu2gpu_queue. The server's GPU orchestrator dequeues and processes the task.
Client Process (GPU-compiled) Memory Layout:

Attached from server (read/write via pinned host memory):
+-----------------------------------------------------------------+
| GpuShmMmap (backend 1000+gpu_id) — attached via shm_attach() |
| cpu2gpu_queue, gpu2gpu_queue pointers reconstructed |
| FutureShm allocations for SendToGpu live here |
+-----------------------------------------------------------------+

Optionally registered with server:
+-----------------------------------------------------------------+
| GpuMalloc (client-owned, backend pid:unique_id) |
| Device memory allocated via cudaMalloc |
| IPC handle sent to server via RegisterMemory(kGpuDeviceMemory) |
| Server opens handle to resolve ShmPtrs from this backend |
+-----------------------------------------------------------------+

GPU Kernel Memory (User-Allocated)

GPU kernels that use the client API require two memory backends passed in IpcManagerGpuInfo:

Primary backend (backend): GpuShmMmap pinned host memory (or GpuMalloc device memory for GPU→GPU). Provides per-thread ArenaAllocator (HSHM_DEFAULT_ALLOC_GPU_T) for fast bump-pointer allocation of FutureShm and task objects.

Heap backend (gpu_heap_backend): GpuMalloc device memory. Provides per-thread BuddyAllocator (CHI_GPU_HEAP_T) for serialization scratch buffers in LocalSaveTaskArchive / LocalLoadTaskArchive. Unlike the arena, the buddy allocator supports individual free(), so scratch memory is reclaimed after each task without exhausting the primary arena.

Use CHI_IPC->GetClientGpuInfo(gpu_id) to build the IpcManagerGpuInfo for same-process kernel launches — it fills all fields automatically:

// Same-process launch (Chimaera running in-process):
chi::IpcManagerGpuInfo gpu_info = CHI_IPC->GetClientGpuInfo(0);
// gpu_info.backend, .gpu2gpu_queue, .gpu_heap_backend all set

// Register primary backend so CPU can resolve GPU ShmPtrs
CHI_IPC->RegisterGpuAllocator(gpu_info.backend.id_,
gpu_info.backend.data_,
gpu_info.backend.data_capacity_);

The CHIMAERA_GPU_INIT macro partitions both backends per-thread:

 backend (ArenaAllocator, HSHM_DEFAULT_ALLOC_GPU_T):
+------------------------------------------------------+
| Pointer table (num_threads * sizeof(ptr)) |
+------------------------------------------------------+
| ArenaAllocator 0 (per_thread_size bytes) |
+------------------------------------------------------+
| ArenaAllocator 1 (per_thread_size bytes) |
+------------------------------------------------------+
| ... |
+------------------------------------------------------+

gpu_heap_backend (BuddyAllocator, CHI_GPU_HEAP_T):
+------------------------------------------------------+
| Pointer table (num_threads * sizeof(ptr)) |
+------------------------------------------------------+
| BuddyAllocator 0 (per_thread_size bytes) |
+------------------------------------------------------+
| BuddyAllocator 1 (per_thread_size bytes) |
+------------------------------------------------------+
| ... |
+------------------------------------------------------+

The ArenaAllocator is a bump-pointer allocator; it is reset (via alloc->Reset()) after each task completes — safe because only one task is in-flight per thread at a time. The BuddyAllocator reclaims memory individually as serialization vectors go out of scope. Access the per-thread heap via CHI_GPU_HEAP (expands to CHI_IPC->GetGpuHeap()).

GPU Initialization Sequence

Server GPU Init (Full Sequence)

The server initializes GPU resources in three phases during ServerInit():

Phase 1: Queue creation (IpcManager::ServerInitGpuQueues())

  1. Query hshm::GpuApi::GetDeviceCount() for number of GPU devices
  2. For each GPU, create a 64 MB GpuShmMmap pinned host backend (ID 1000+gpu_id) for cpu2gpu and gpu2cpu queues plus CPU→GPU FutureShm allocations
  3. Create a device-memory backend (ID 3000+gpu_id) for the gpu2gpu_queue and its per-thread ArenaAllocator
  4. Create two TaskQueue objects in the pinned backend:
    • gpu2cpu_queues_[gpu_id] — GPU→CPU queue (CPU worker polls this)
    • cpu2gpu_queues_[gpu_id] — CPU→GPU queue (orchestrator polls this)
  5. Create one TaskQueue in device memory:
    • gpu2gpu_queues_[gpu_id] — GPU→GPU queue (orchestrator polls this)
  6. Each queue has 1 lane, 2 priorities (normal + resumed), configurable depth
  7. Create orchestrator scratch backends (64 MB GpuShmMmap, IDs 2000+gpu_id)
  8. Create GPU heap backends (64 MB GpuMalloc, IDs 9000+gpu_id) for CHI_GPU_HEAP_T = BuddyAllocator
  9. Register all backends with RegisterGpuAllocator() for host-side ShmPtr resolution
  10. Populate gpu_orchestrator_info_ struct for GPU 0 including gpu_heap_backend

Phase 2: Pool composition (deferred)

  • All compose-section pools are created before the orchestrator launches
  • GPU containers are allocated during pool creation via autogenerated allocation kernels

Phase 3: Orchestrator launch (IpcManager::LaunchGpuOrchestrator())

  1. Read gpu_blocks and gpu_threads_per_block from configuration
  2. Set CUDA stack size to 131072 bytes (cudaDeviceSetLimit)
  3. Allocate WorkOrchestratorControl in pinned host memory
  4. Allocate gpu::PoolManager on device
  5. Create dedicated CUDA stream
  6. Launch persistent kernel: chimaera_gpu_orchestrator<<<blocks, threads_per_block, 0, stream>>>()
  7. Only block 0, thread 0 runs the gpu::Worker poll loop; other threads spin-wait for exit

Why deferred? The orchestrator occupies all SMs with persistent thread blocks. If launched before pool composition, cudaMalloc calls during GPU container allocation would deadlock against the persistent kernel.

Client GPU Init (None)

ClientInit() performs no GPU initialization:

  • No ServerInitGpuQueues()cpu2gpu_queues_ remains empty
  • No LaunchGpuOrchestrator()gpu_orchestrator_ remains nullptr
  • No GPU memory backends allocated

The client can only submit tasks through CPU transport (SHM, TCP, IPC).

Queue Assignment Algorithm

Worker Thread Partitioning

The DefaultScheduler::DivideWorkers() method partitions the configured worker threads into four roles:

Given N total worker threads (from num_threads in chimaera.yaml):

Worker 0 -> Scheduler worker (handles task routing)
Workers 1..N-3 -> I/O workers (execute tasks)
Worker N-2 -> GPU worker (polls GPU->CPU lanes)
Worker N-1 -> Network worker (handles ZMQ send/recv)

Example with num_threads: 4:

  • Worker 0: scheduler
  • Worker 1: I/O worker (also used as GPU worker since N-2 = 2, overlaps I/O range)
  • Worker 2: GPU worker
  • Worker 3: network worker

The GPU worker is assigned only when total_workers > 2. With 2 or fewer workers, there is no dedicated GPU worker and GPU→CPU tasks cannot be processed.

GPU Lane Assignment

After ServerInitGpuQueues() creates the per-GPU TaskQueue objects, AssignGpuLanesToWorker() connects them to the GPU worker:

void IpcManager::AssignGpuLanesToWorker() {
Worker *gpu_worker = scheduler_->GetGpuWorker();

// Collect lane 0 from each GPU's gpu_queue (GPU->CPU)
for (size_t gpu_id = 0; gpu_id < num_gpus; ++gpu_id) {
TaskLane *gpu_lane = &gpu_queues_[gpu_id]->GetLane(0, 0);
gpu_lane->SetAssignedWorkerId(gpu_worker->GetId());
gpu_lanes.push_back(gpu_lane);
}

gpu_worker->SetGpuLanes(gpu_lanes);
}

Result: The single GPU worker polls all GPU→CPU lanes across all GPUs. The GPU work orchestrator (persistent kernel) polls the CPU→GPU and GPU→GPU queues independently.

Queue Routing Summary

                    Pure Client Process              Server Process
+-----------------+ +------------------+
| No GPU queues | | cpu2gpu_queues_[]|----> GPU Orchestrator
| No orchestrator| | gpu2gpu_queues_[]|----> GPU Orchestrator
AsyncGpuSubmit() ->-| SendShm() or |--SHM/TCP/-->| gpu_queues_[] |----> GPU Worker (CPU)
(PoolQuery::Local) | SendZmq() | IPC | |
+-----------------+ +------------------+
|
CPU Workers execute
the task's CPU handler

Client-to-Server Transport Modes

The pure client's transport mode is determined by the CHI_IPC_MODE environment variable or chimaera.yaml configuration:

ModeTransportHow FutureShm is AllocatedHow Task Data is Sent
SHMShared memoryIn server's SHM segmentWritten to FutureShm::copy_space ring buffer
TCPZMQ DEALER/ROUTERProcess-local heap (HSHM_MALLOC)Serialized through ZMQ TCP socket
IPCZMQ DEALER/ROUTER (Unix socket)Process-local heap (HSHM_MALLOC)Serialized through ZMQ IPC socket

For SHM mode, the server worker deserializes from the FutureShm::copy_space ring buffer using ShmTransport::Recv(). For TCP/IPC mode, the server receives the serialized data through the ZMQ ROUTER socket and deserializes on the network worker thread.

Prerequisites

CMake flags

Enable GPU support when configuring the build:

# CUDA
cmake .. -DWRP_CORE_ENABLE_CUDA=ON

# ROCm
cmake .. -DWRP_CORE_ENABLE_ROCM=ON

These set the HSHM_ENABLE_CUDA or HSHM_ENABLE_ROCM preprocessor macros, which gate all GPU code paths.

Required headers

#include <chimaera/chimaera.h>
#include <chimaera/singletons.h>
#include <chimaera/task.h>
#include <chimaera/pool_query.h>
#include <hermes_shm/util/gpu_api.h>

// Your module's client and task headers
#include <chimaera/my_module/my_module_client.h>
#include <chimaera/my_module/my_module_tasks.h>

Writing a GPU-Compatible Task

A task that can be created and submitted from a GPU kernel must:

  1. Mark constructors and serialization methods with HSHM_CROSS_FUN
  2. Provide SerializeIn and SerializeOut methods
  3. Use only GPU-safe types (no std::string, std::vector, etc.)
struct MyGpuTask : public chi::Task {
IN chi::u32 input_value_;
INOUT chi::u32 result_;

HSHM_CROSS_FUN MyGpuTask() : chi::Task(), input_value_(0), result_(0) {}

HSHM_CROSS_FUN explicit MyGpuTask(
const chi::TaskId& task_id,
const chi::PoolId& pool_id,
const chi::PoolQuery& query,
chi::u32 input_value)
: chi::Task(task_id, pool_id, query, kMethodId),
input_value_(input_value), result_(0) {
task_id_ = task_id;
pool_id_ = pool_id;
method_ = kMethodId;
task_flags_.Clear();
pool_query_ = query;
}

// Serialization: fields sent GPU -> CPU
template <typename Archive>
HSHM_CROSS_FUN void SerializeIn(Archive& ar) {
Task::SerializeIn(ar);
ar(input_value_, result_);
}

// Serialization: fields sent CPU -> GPU (after execution)
template <typename Archive>
HSHM_CROSS_FUN void SerializeOut(Archive& ar) {
Task::SerializeOut(ar);
ar(result_);
}
};

Important: Every function called during serialization must be HSHM_CROSS_FUN (i.e., __host__ __device__). Without this annotation, NVCC's SFINAE-based dispatch silently skips serialization on the GPU, producing corrupted data with no compiler error.

Writing a GPU-Compatible Client

The client class needs HSHM_CROSS_FUN on the constructor and any methods called from GPU code:

class Client : public chi::ContainerClient {
public:
HSHM_CROSS_FUN Client() = default;
HSHM_CROSS_FUN explicit Client(const chi::PoolId& pool_id) { Init(pool_id); }

// Host-only methods (AsyncCreate, etc.) stay as-is...

// GPU-callable task submission
HSHM_CROSS_FUN
chi::Future<MyGpuTask> AsyncMyTask(const chi::PoolQuery& query,
chi::u32 input_value) {
auto* ipc = CHI_IPC;
auto task = ipc->NewTask<MyGpuTask>(
chi::CreateTaskId(), pool_id_, query, input_value);
return ipc->Send(task);
}
};

CHI_IPC->Send() automatically dispatches based on compilation context and routing mode:

  • GPU device code (HSHM_IS_GPU): routes via SendGpu or SendGpuLocal
  • CPU host code with HSHM_ENABLE_CUDA=1: detects LocalGpuBcast/ToLocalGpu and routes via SendToGpu
  • CPU host code without GPU: routes through normal CPU routing (pool query → worker)

GPU Routing Modes

GPU → GPU (Local)

A GPU kernel submits a task that is processed by the GPU work orchestrator. The task runs on the GPU itself.

// Inside a __global__ kernel
chimaera::my_module::Client client(pool_id);
auto future = client.AsyncMyTask(chi::PoolQuery::Local(), input);
future.Wait();
chi::u32 result = future->result_;

The GPU orchestrator must be running (not paused). The task is dispatched to your module's GpuRuntime::MyMethod() handler.

CPU → GPU (LocalGpuBcast)

CPU host code submits a task to be processed by the GPU work orchestrator. The task runs on the GPU.

// Host code -- must be compiled with HSHM_ENABLE_CUDA=1
chimaera::my_module::Client client(pool_id);
auto future = client.AsyncMyTask(chi::PoolQuery::LocalGpuBcast(), input);
future.Wait(); // blocks until GPU completes + deserializes output
chi::u32 result = future->result_;

How it works:

  1. Send() detects LocalGpuBcast and calls SendToGpu() instead of normal CPU routing
  2. SendToGpu() allocates a FutureShm in GPU-accessible pinned host memory, serializes task input into its ring buffer, and pushes to the cpu2gpu_queue
  3. The GPU orchestrator pops the task, deserializes, dispatches to GpuRuntime, serializes output, and sets FUTURE_COMPLETE
  4. Wait() spins on FUTURE_COMPLETE, then deserializes output from the FutureShm ring buffer back into the task

Important -- HSHM_ENABLE_CUDA must be 1: The Send() GPU routing interceptor is inside #if HSHM_ENABLE_CUDA || HSHM_ENABLE_ROCM. If your source file is compiled with HSHM_ENABLE_CUDA=0 (e.g., a plain C++ file in a mixed CUDA/C++ target), the interceptor is compiled out and the task will go through normal CPU routing, which does not complete the future correctly for GPU tasks.

Solution: Call AsyncMyTask(LocalGpuBcast(), ...) from a source file compiled as CUDA (.cu or with LANGUAGE CUDA property), or from a wrapper function defined in such a file.

CPU → Specific GPU (ToLocalGpu)

Same as LocalGpuBcast but targets a specific GPU by device ID:

// Send to GPU device 1
auto future = client.AsyncMyTask(chi::PoolQuery::ToLocalGpu(1), input);
future.Wait();

GPU → CPU (ToLocalCpu)

A GPU kernel submits a task to be processed by a CPU worker thread. The task runs on the CPU.

// Inside a __global__ kernel
chimaera::my_module::Client client(pool_id);
auto future = client.AsyncMyTask(chi::PoolQuery::ToLocalCpu(), input);
future.Wait();
chi::u32 result = future->result_;

How it works:

  1. Send() on the GPU calls SendGpu(), which serializes the task and pushes to the gpu2cpu_queue
  2. The CPU GPU-worker thread polls this queue, deserializes, and routes to the module's CPU runtime handler
  3. The CPU worker serializes output into the FutureShm ring buffer and sets FUTURE_COMPLETE
  4. Wait() on the GPU calls RecvGpu() which reads from the output ring buffer

Host setup required: The GPU kernel needs a gpu2cpu_queue registered with the runtime. See Host-Side Setup for GPU → CPU below.

GPU Work Orchestrator

The GPU work orchestrator is a persistent CUDA kernel that processes tasks on the GPU. It runs on a dedicated CUDA stream and polls two queues:

  • cpu2gpu_queue -- tasks pushed by CPU via SendToGpu()
  • gpu2gpu_queue -- tasks pushed by GPU kernels via SendGpuLocal()

The runtime launches the orchestrator automatically during ServerInit() when GPU support is enabled.

Persistent Kernel Structure

The orchestrator is launched with configurable blocks and threads per block. Only block 0, thread 0 runs the worker poll loop; all other threads spin-wait for the exit signal:

__global__ void chimaera_gpu_orchestrator(gpu::PoolManager *pool_mgr,
gpu::WorkOrchestratorControl *control,
IpcManagerGpuInfo gpu_info,
u32 num_blocks) {
CHIMAERA_GPU_ORCHESTRATOR_INIT(gpu_info, num_blocks);

if (blockIdx.x == 0 && threadIdx.x == 0) {
control->running_flag = 1;
gpu::Worker worker;
worker.Init(0, gpu_info.cpu2gpu_queue, gpu_info.gpu2gpu_queue,
pool_mgr, gpu_info.cpu2gpu_queue_base);

while (!control->exit_flag) {
worker.PollOnce();
}
worker.Finalize();
}

// Other blocks/threads: wait for exit signal
if (blockIdx.x != 0 || threadIdx.x != 0) {
while (!control->exit_flag) { /* spin */ }
}
}

The extra blocks are launched to occupy all SMs, preventing other kernels from preempting the orchestrator.

GPU Container (GpuRuntime)

Each module that supports GPU execution defines a GpuRuntime class:

// my_module_gpu_runtime.h
class GpuRuntime : public chi::gpu::Container {
public:
HSHM_GPU_FUN GpuRuntime() = default;
HSHM_GPU_FUN ~GpuRuntime() override = default;

/** GPU handler for MyMethod */
HSHM_GPU_FUN void MyMethod(hipc::FullPtr<MyGpuTask> task,
chi::gpu::GpuRunContext &rctx) {
task->result_ = task->input_value_ * 3;
}

// Autogenerated virtual method overrides (switch-case dispatch)
#include "autogen/my_module_gpu_lib_exec.h"
};

Module YAML Configuration

Enable GPU support in your module's chimaera_mod.yaml:

module_name: my_module
namespace: chimaera

# ... method IDs ...
kMyMethod: 25

# GPU support
has_gpu: true
gpu_methods:
- kMyMethod

Run chimaera repo refresh . to regenerate the autogen dispatch code:

  • Per-module: autogen/my_module_gpu_lib_exec.h (virtual method overrides)
  • Repo-level: src/autogen/gpu_work_orchestrator_modules.h (container allocation + registry)

Pause/Resume

The GPU orchestrator occupies all SMs with persistent thread blocks. To launch other GPU kernels (e.g., test kernels), you must pause the orchestrator first:

CHI_IPC->PauseGpuOrchestrator();   // signal exit + synchronize stream
// ... launch your kernel on a separate stream ...
CHI_IPC->ResumeGpuOrchestrator(); // relaunch persistent kernel

Important: Use stream-based synchronization (hshm::GpuApi::Synchronize(stream)) for your own kernels, not cudaDeviceSynchronize(). The latter would block on the orchestrator's stream.

Container Registration

GPU containers are automatically allocated and registered when a pool with has_gpu: true is created. The allocation kernel runs in the orchestrator's CUDA module context so vtables are correct for virtual dispatch.

GPU Kernel Implementation

Initialization Macros

Three macros are available depending on the kernel's role:

CHIMAERA_GPU_INIT(gpu_info)

For regular client kernels (not the orchestrator). Initializes the per-block IpcManager from a fully-populated IpcManagerGpuInfo. All threads in a block share one IpcManager instance (in __shared__ memory); thread 0 runs ClientInitGpu, then __syncthreads() makes it visible to all threads.

__global__ void my_kernel(chi::IpcManagerGpu gpu_info,
chi::PoolId pool_id,
chi::u32 input,
chi::u32* d_output) {
CHIMAERA_GPU_INIT(gpu_info);

chimaera::my_module::Client client(pool_id);
auto future = client.AsyncMyTask(chi::PoolQuery::Local(), input);
future.Wait();
*d_output = future->result_;
}

CHI_CLIENT_GPU_INIT(gpu_info)

Alias for CHIMAERA_GPU_INIT. Use this in client-process kernels for clarity:

__global__ void client_kernel(chi::IpcManagerGpu gpu_info, ...) {
CHI_CLIENT_GPU_INIT(gpu_info);
// identical behavior
}

CHIMAERA_GPU_ORCHESTRATOR_INIT(gpu_info, num_blocks)

For kernels that partition their memory across multiple blocks (e.g., benchmark client kernels and the runtime orchestrator itself). Before calling ClientInitGpu, this macro splits all three backends (backend, gpu2cpu_backend, gpu_heap_backend) so each block gets its own non-overlapping slice:

per_block = data_capacity / num_blocks
block N slice starts at: data_ + N * per_block

Use this when launching multiple blocks where each block submits tasks independently:

__global__ void bench_kernel(chi::IpcManagerGpu gpu_info,
chi::PoolId pool_id,
chi::u32 num_blocks,
chi::u32 total_tasks,
int* d_done) {
CHIMAERA_GPU_ORCHESTRATOR_INIT(gpu_info, num_blocks);

if (threadIdx.x != 0) return; // only thread 0 per block submits

chimaera::my_module::Client client(pool_id);
for (chi::u32 i = 0; i < total_tasks; ++i) {
auto future = client.AsyncMyTask(chi::PoolQuery::Local(), i);
future.Wait();
}

if (blockIdx.x == 0) {
__threadfence_system();
*d_done = 1;
}
}

Host-Side Setup

Host-Side Setup for GPU → GPU

For PoolQuery::Local() from a GPU kernel, the task goes through the gpu2gpu_queue which is managed by the orchestrator. The simplest approach is GetClientGpuInfo(), which fills all IpcManagerGpuInfo fields automatically from the running runtime:

// Build IpcManagerGpuInfo — fills backend, gpu2gpu_queue, gpu_heap_backend, etc.
chi::IpcManagerGpuInfo gpu_info = CHI_IPC->GetClientGpuInfo(0);

// Register primary backend so CPU can resolve GPU ShmPtrs
CHI_IPC->RegisterGpuAllocator(gpu_info.backend.id_,
gpu_info.backend.data_,
gpu_info.backend.data_capacity_);

For custom backends (e.g., benchmark kernels with a dedicated allocation region):

// Custom primary backend (pinned host, for ArenaAllocator)
hipc::MemoryBackendId backend_id(100, 0);
hipc::GpuShmMmap gpu_backend;
gpu_backend.shm_init(backend_id, 10 * 1024 * 1024, "/gpu_test", 0);
CHI_IPC->RegisterGpuAllocator(backend_id, gpu_backend.data_,
gpu_backend.data_capacity_);

// Custom heap backend (device memory, for BuddyAllocator / CHI_GPU_HEAP_T)
hipc::MemoryBackendId heap_id(101, 0);
hipc::GpuMalloc gpu_heap;
gpu_heap.shm_init(heap_id, 4 * 1024 * 1024, "/gpu_heap", 0);

chi::IpcManagerGpuInfo gpu_info;
gpu_info.backend = gpu_backend;
gpu_info.gpu_heap_backend = gpu_heap;
gpu_info.gpu2cu_queue = nullptr; // not using GPU->CPU
gpu_info.cpu2gpu_queue = nullptr; // not receiving CPU->GPU
gpu_info.gpu2gpu_queue = CHI_IPC->GetGpuToGpuQueue(0);

Use pinned host memory for result polling (not cudaDeviceSynchronize, which hangs with the persistent orchestrator):

int *d_result;
cudaMallocHost(&d_result, sizeof(int));
*d_result = 0;

// Pause orchestrator, launch kernel, resume orchestrator
CHI_IPC->PauseGpuOrchestrator();
my_kernel<<<1, 1, 0, stream>>>(gpu_info, pool_id, input, d_result, d_out);
CHI_IPC->ResumeGpuOrchestrator();

// Poll pinned memory for completion
while (*d_result == 0) {
std::this_thread::sleep_for(std::chrono::microseconds(100));
}

Host-Side Setup for GPU → CPU

For PoolQuery::ToLocalCpu() from a GPU kernel, the task goes through the gpu2cpu_queue that a CPU worker thread polls. The host must create and register this queue, and also provide a pinned-host gpu2cpu_backend so the GPU kernel can allocate FutureShm objects:

// 1. Primary backend (device memory, for ArenaAllocator)
hipc::MemoryBackendId backend_id(5, 0);
hipc::GpuMalloc gpu_backend;
gpu_backend.shm_init(backend_id, 10 * 1024 * 1024, "/gpu_to_cpu", 0);

// 2. GPU->CPU copy-space backend (pinned host, for FutureShm allocation)
hipc::MemoryBackendId copy_backend_id(6, 0);
hipc::GpuShmMmap copy_backend;
copy_backend.shm_init(copy_backend_id, 4 * 1024 * 1024, "/gpu_to_cpu_copy", 0);
CHI_IPC->RegisterGpuAllocator(copy_backend_id, copy_backend.data_,
copy_backend.data_capacity_);

// 3. GPU task queue (pinned shared memory)
hipc::MemoryBackendId queue_backend_id(7, 0);
hipc::GpuShmMmap queue_backend;
queue_backend.shm_init(queue_backend_id, 2 * 1024 * 1024, "/gpu_to_cpu_q", 0);

auto *queue_alloc = queue_backend.MakeAlloc<hipc::ArenaAllocator<false>>(
queue_backend.data_capacity_);
auto gpu_queue_ptr = queue_alloc->NewObj<chi::TaskQueue>(
queue_alloc, 1, 2, 1024);

// 4. Register queue and assign to CPU GPU-worker
CHI_IPC->RegisterGpuQueue(gpu_queue_ptr);
CHI_IPC->AssignGpuLanesToWorker();
CHI_IPC->RegisterGpuAllocator(backend_id, gpu_backend.data_,
gpu_backend.data_capacity_);

// 5. Build IpcManagerGpuInfo and launch kernel
chi::IpcManagerGpuInfo gpu_info;
gpu_info.backend = gpu_backend;
gpu_info.gpu2cpu_queue = gpu_queue_ptr.ptr_;
gpu_info.gpu2cpu_backend = copy_backend;

CPU → GPU (No Extra Host Setup)

For PoolQuery::LocalGpuBcast() from CPU host code, no additional host setup is needed beyond creating the pool. The runtime creates the cpu2gpu_queue and GPU orchestrator during ServerInit().

// Compile this in a CUDA source file (HSHM_ENABLE_CUDA=1)
chi::CHIMAERA_INIT(chi::ChimaeraMode::kClient, true);

chimaera::my_module::Client client(pool_id);
auto create = client.AsyncCreate(chi::PoolQuery::Dynamic(), name, pool_id);
create.Wait();

// Submit task to GPU -- Send() detects LocalGpuBcast -> SendToGpu()
auto future = client.AsyncMyTask(chi::PoolQuery::LocalGpuBcast(), input);
future.Wait(); // spins on FUTURE_COMPLETE, deserializes GPU output
chi::u32 result = future->result_;

Key Registration Functions

FunctionPurpose
CHI_IPC->GetClientGpuInfo(gpu_id)Build a fully-populated IpcManagerGpuInfo for same-process kernel launches
CHI_IPC->RegisterGpuQueue(queue)Registers a GPU task queue so the runtime can poll it
CHI_IPC->AssignGpuLanesToWorker()Assigns all registered GPU queue lanes to the designated GPU worker thread
CHI_IPC->RegisterGpuAllocator(id, data, capacity)Registers GPU backend memory so the host can resolve ShmPtrs allocated by the GPU kernel
CHI_IPC->PauseGpuOrchestrator()Pauses the persistent GPU orchestrator to free SMs
CHI_IPC->ResumeGpuOrchestrator()Resumes the GPU orchestrator
CHI_IPC->GetGpuToGpuQueue(gpu_id)Returns the orchestrator's GPU→GPU queue pointer for a given device
CHI_GPU_HEAPPer-thread BuddyAllocator (CHI_GPU_HEAP_T) from the GPU heap table; valid in device code after CHIMAERA_GPU_INIT

Configuration Parameters

GPU Stack Size

GPU kernels that use the client API involve deep template instantiation chains (serialization, ring buffer, allocator). The default CUDA stack size (1024 bytes) is insufficient.

cudaDeviceSetLimit(cudaLimitStackSize, 131072);  // 128 KB recommended

Symptoms of stack overflow: silent kernel completion with corrupted or missing data (no error reported by cudaGetLastError).

Ring Buffer Size (Copy Space)

The SPSC ring buffer size determines how much serialized data can be in-flight between GPU and CPU. This is set per-task:

// Inside your task class, override GetCopySpaceSize:
HSHM_CROSS_FUN size_t GetCopySpaceSize() const { return 8192; }

The default is 4096 bytes if GetCopySpaceSize() returns 0. The FutureShm is allocated as sizeof(FutureShm) + copy_space_size, with separate input_ and output_ ring buffers sharing the same copy space.

For tasks with large serialized payloads, increase this value. If the ring buffer is too small, the producer will spin-wait for the consumer to drain data.

GPU Memory Backend Sizes

Two backends must be sized independently:

Primary backend (backend, ArenaAllocator): holds FutureShm + task objects. Reset after each task so size only needs to cover one task at a time:

// Per-block: sizeof(FutureShm) + copy_space + sizeof(TaskT) + overhead
// 10 MB per block is sufficient for typical tasks
size_t backend_size = num_blocks * 10 * 1024 * 1024;

Heap backend (gpu_heap_backend, BuddyAllocator): holds serialization scratch vectors. Individual free() reclaims memory, so size covers peak concurrent scratch usage (~2× copy_space per task):

// Per-block: ~2 * copy_space_size + BuddyAllocator overhead
// 4 MB per block is sufficient for typical tasks
size_t heap_backend_size = num_blocks * 4 * 1024 * 1024;

A single task submission typically consumes from the primary arena:

  • sizeof(FutureShm) + copy_space_size for the future
  • sizeof(TaskT) for the deserialized task object (orchestrator side)

And from the heap:

  • Input serialization vector (~copy_space_size)
  • Output serialization vector (~copy_space_size)

GPU Task Queue Configuration

// Parameters: allocator, num_lanes, num_priorities, depth
queue_alloc->NewObj<chi::TaskQueue>(queue_alloc, 1, 2, 1024);
ParameterDefaultDescription
num_lanes1Number of queue lanes (1 per GPU is typical)
num_priorities2Priority levels per lane
depth1024Maximum tasks per lane before blocking

Queue Backend Size

The queue backend holds the TaskQueue data structure and its internal lane storage:

size_t queue_memory_size = 2 * 1024 * 1024;  // 2 MB is sufficient for most use cases

Runtime Thread Configuration

The Chimaera runtime automatically detects GPU devices and creates GPU queues during ServerInit. The default scheduler (DefaultScheduler) partitions workers as follows:

# chimaera.yaml
num_threads: 4

With num_threads: N:

Worker IDRoleResponsibility
0SchedulerTask routing and scheduling
1 to N-3I/O workersExecute tasks from CPU queues
N-2GPU workerPolls GPU→CPU queue lanes
N-1Network workerHandles ZMQ send/recv

The GPU worker polls all GPU→CPU lanes across all GPU devices. With num_threads: 4, you get 1 scheduler, 1 I/O worker (which doubles as the GPU worker range), 1 GPU worker, and 1 network worker.

CMake Integration

Mixed CUDA/C++ Targets

When building a test or application that has both CUDA and C++ source files, keep GPU-routing calls in the CUDA source file. The typical CMake pattern:

# GPU kernels compiled as CUDA
set_source_files_properties(my_gpu_kernels.cc PROPERTIES LANGUAGE CUDA)
add_library(my_gpu_kernels OBJECT my_gpu_kernels.cc)
target_link_libraries(my_gpu_kernels PRIVATE hshm::cuda_cxx)

# Main executable with CPU sources
add_executable(my_app my_main.cc $<TARGET_OBJECTS:my_gpu_kernels>)

# CRITICAL: CPU sources are compiled with HSHM_ENABLE_CUDA=0 to avoid
# __device__ errors. GPU routing in Send() is compiled out.
set_source_files_properties(my_main.cc PROPERTIES
COMPILE_OPTIONS "-UHSHM_ENABLE_CUDA;-DHSHM_ENABLE_CUDA=0"
)

Consequence: client.AsyncMyTask(LocalGpuBcast(), ...) called from my_main.cc will NOT route to the GPU because the Send() interceptor is compiled out. Instead, define a wrapper in the CUDA source file:

// my_gpu_kernels.cc (compiled as CUDA, HSHM_ENABLE_CUDA=1)
extern "C" int run_cpu_to_gpu(chi::PoolId pool_id, chi::u32 input,
chi::u32 *out_result) {
chimaera::my_module::Client client(pool_id);
auto future = client.AsyncMyTask(chi::PoolQuery::LocalGpuBcast(), input);
if (!future.Wait(10.0f)) return -3; // timeout
*out_result = future->result_;
return 1;
}

GPU Module Include Directories

GPU targets need $<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}> for autogen headers:

target_include_directories(my_gpu_target PRIVATE
$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}>
)

Data Flow Details

Send Path (GPU → CPU)

  1. NewTask: Allocates the task from the per-thread ArenaAllocator
  2. SendGpu: Allocates a FutureShm with embedded ring buffer copy space
  3. Enqueue: Pushes the Future onto the GPU task queue lane (the CPU worker polls this)
  4. Serialize + Send: Builds a LocalSaveTaskArchive, calls task->SerializeIn(ar), then streams the archive through ShmTransport::Send into the input ring buffer

Receive Path (CPU → GPU)

  1. CPU worker dequeues the FutureShm, reads from the input ring buffer via ShmTransport::Recv, deserializes, executes the task, then writes output back through the output ring buffer
  2. RecvGpu: On the GPU, Future::Wait() calls RecvGpu which reads from the output ring buffer via ShmTransport::Recv
  3. Deserialize: A LocalLoadTaskArchive deserializes the output fields via task->SerializeOut(ar)
  4. Completion: The GPU spins on FUTURE_COMPLETE flag (should already be set after Recv completes)

CPU → GPU Send Path (SendToGpu)

  1. Send() detects LocalGpuBcast/ToLocalGpu routing and calls SendToGpu()
  2. SendToGpu allocates a FutureShm in GPU-accessible pinned host memory via AllocateGpuBuffer()
  3. Serializes task input via task->SerializeIn(ar) + ShmTransport::Send into the input ring buffer
  4. Pushes a Future<Task> onto cpu2gpu_queues_[gpu_id]
  5. Returns a Future<TaskT> with the pinned-memory FutureShm

CPU → GPU Receive Path (Wait runtime path)

  1. Wait() enters the runtime spin-wait loop on FUTURE_COMPLETE
  2. The GPU orchestrator pops the task, dispatches to GpuRuntime, serializes output into the FutureShm's output ring buffer, calls __threadfence_system(), and sets FUTURE_COMPLETE
  3. Wait() detects FUTURE_COPY_FROM_CLIENT flag and non-zero output_.copy_space_size_, then deserializes output via ShmTransport::Recv + task->SerializeOut(ar)

Client Process → Server CPU Path (SendShm/SendZmq)

When a pure client uses PoolQuery::Local(), the task bypasses all GPU queues and goes through the standard client transport:

  1. SendShm: Allocates FutureShm in server's SHM segment with copy_space ring buffer. Enqueues to worker_queues_ lane (mapped by ClientMapTask). Serializes task via SaveTaskArchive + ShmTransport::Send into the ring buffer.
  2. Server worker: Dequeues FutureShm, checks FUTURE_COPY_FROM_CLIENT flag, reads from copy_space ring buffer via ShmTransport::Recv, deserializes, executes the CPU handler (e.g., result = test_value * 2 + gpu_id).
  3. Response: Worker serializes output back through the ring buffer, sets FUTURE_COMPLETE.
  4. Client Wait(): Spins on FUTURE_COMPLETE, deserializes output from copy_space.

This path uses the CPU runtime handler, not the GPU handler. The computation differs:

  • CPU handler: result_value = test_value * 2 + gpu_id
  • GPU handler: result_value = test_value * 3 + gpu_id

Wire Format

GPU and CPU use compatible but different archive types:

SideSerializationDeserialization
GPU (send)LocalSaveTaskArchiveLocalLoadTaskArchive
CPU (worker)SaveTaskArchiveLoadTaskArchive

Both produce the same binary wire format: [vector_size][elements...][msg_type][stream_data]. This is critical -- if the formats diverge, deserialization will produce corrupted data.

Client Process GPU Tests

The test file context-runtime/modules/MOD_NAME/test/test_gpu_client_process.cc verifies cross-process GPU task submission using the fork-based test pattern. A child process runs as a standalone Chimaera server (with --server-mode), while the parent connects as a pure client.

Test Infrastructure

All tests share the same server/client lifecycle managed by EnsureInitialized():

  1. StartServerProcess(): Forks a child, sets CHI_WITH_RUNTIME=1, and execls back into the same binary with --server-mode. The child calls CHIMAERA_INIT(kServer, true) and sleeps until killed.
  2. WaitForServer(): Polls for the SHM segment file (/tmp/chimaera_<user>/chi_main_segment_<user>) up to 50 times at 200ms intervals.
  3. Parent client init: Sets CHI_WITH_RUNTIME=0 and calls CHIMAERA_INIT(kClient, false). During ClientConnect, the client receives GPU queue metadata from the server.
  4. Cleanup: CleanupServer() sends SIGKILL to the server process group and removes SHM files.

Test Cases

client_process_gpu_submit_local

Submits a GpuSubmitTask with PoolQuery::Local() routing from the client process. The task travels via SHM or TCP transport to the server's CPU worker, which executes the CPU handler (result = test_value * 2 + gpu_id). Verifies the result matches the expected CPU computation.

This test works regardless of GPU support — Local() routing always goes through the CPU worker.

client_process_gpu_submit_multiple

Submits 5 sequential GpuSubmitTasks with different test_value parameters (100–104), each via PoolQuery::Local(). Verifies that all return correct results, confirming the transport handles repeated task round-trips without state corruption.

client_process_gpu_queue_attachment

Verifies that ClientInitGpuQueues() successfully attached to the server's GPU queues during ClientConnect. Checks that GetToGpuQueueCount() > 0 and that GetToGpuQueue(0) and GetGpuToGpuQueue(0) return non-null pointers.

Requires HSHM_ENABLE_CUDA or HSHM_ENABLE_ROCM at compile time. Without GPU support, prints "GPU support not compiled, skipping" and passes.

client_process_gpu_submit_to_gpu

Submits a GpuSubmitTask with PoolQuery::LocalGpuBcast() routing. This exercises the SendToGpu() path: the client serializes the task into the attached GpuShmMmap queue backend and pushes it onto cpu2gpu_queue. The server's GPU orchestrator dequeues and dispatches to the module's GPU handler.

Requires GPU support compiled in and a GPU device on the server. Verifies that result = test_value * 2 + gpu_id (GPU handler uses the same formula in this test module).

client_process_register_gpu_memory

Tests cross-process GPU memory registration. The client:

  1. Allocates a 4 MB GpuMalloc backend (device memory via cudaMalloc)
  2. Reads the IPC handle from the backend's private header (GPU→host memcpy)
  3. Sends a RegisterMemoryTask with MemoryType::kGpuDeviceMemory to the server
  4. The server opens the IPC handle via GpuMalloc::shm_attach_ipc() and registers the allocator in gpu_alloc_map_

After registration, the server can resolve ShmPtrs pointing into this client-owned device memory.

Building and Running

cd /workspace/build
cmake /workspace && cmake --build . -j$(nproc)

# CRITICAL: install ALL .so files (stale modules cause serialization mismatches)
sudo cp bin/lib*.so /usr/local/lib/ && sudo ldconfig

# Run all client process GPU tests
./bin/chimaera_gpu_client_process_tests

# Run a specific test
./bin/chimaera_gpu_client_process_tests client_process_gpu_submit_local

Note: Tests 3–5 (gpu_queue_attachment, gpu_submit_to_gpu, register_gpu_memory) require the test binary to be compiled with CUDA/ROCm support. Without it, they skip gracefully. Tests 1–2 work with any build configuration.

Performance

GPU→GPU Task Round-Trip Latency

The following latency was measured using the in-process benchmark (bench_gpu_runtime) with 1 client block × 1 active thread (sequential AsyncGpuSubmit + Wait() loop):

Allocator (CHI_GPU_HEAP_T)Avg LatencyThroughput (1 thread)Notes
BuddyAllocator (default)~200 µs~5,000 tasks/secManaged pool; per-allocation free; no exhaustion
MallocAllocator (device heap)~400 µs~2,500 tasks/secCUDA device malloc()/free(); serialized across warps

BuddyAllocator backed by GpuMalloc device memory is the default because it is 2× faster than CUDA device heap malloc() and does not exhaust memory across arbitrarily many tasks.

How Latency Is Measured

The benchmark timer starts on the CPU immediately after ResumeGpuOrchestrator() and ends when the CPU polls *d_done == 1 (set by the GPU after the last task completes). The formula is:

avg_latency = elapsed_ms * 1000 / total_tasks   (µs per task)

This captures the full GPU→GPU round-trip per task:

  1. AsyncGpuSubmit: serialize inputs into FutureShm ring buffer, push to gpu2gpu_queue
  2. Orchestrator polls, dispatches DispatchTask: deserialize → Run() → serialize output
  3. FUTURE_COMPLETE written with __threadfence_system() + system-scope atomic
  4. future.Wait() spin exits, RecvGpu deserializes output

Note: the first task's queue-push overhead is excluded because the GPU kernel starts before the CPU timer. CPU polling overhead (100 µs sleep per iteration) adds at most ~0.1 µs per task across 1,000 tasks and is negligible.

Memory Limits

The default GpuShmMmap orchestrator scratch (HSHM_DEFAULT_ALLOC_GPU_T = ArenaAllocator) is reset after each completed task (alloc->Reset()), so it never exhausts regardless of task count. The BuddyAllocator heap (CHI_GPU_HEAP_T) reclaims memory individually as serialization vectors go out of scope.

Troubleshooting

CUDA Error 700 (Illegal Memory Access)

Usually caused by GPU stack overflow during deep serialization template chains. Increase the stack size:

cudaDeviceSetLimit(cudaLimitStackSize, 131072);

CUDA Error 715 (Illegal Instruction)

Caused by using system-scope atomics on pinned host memory, or by cross-library virtual function calls. Solutions:

  • Use device-scope atomics (atomicAdd, atomicOr) with explicit __threadfence_system() at strategic points
  • Ensure all GPU containers are allocated within the orchestrator's CUDA module context (the autogen system handles this)

Silent data corruption (kernel completes but wrong values)

Check that all serialization functions are marked HSHM_CROSS_FUN. Without __device__ annotation, NVCC's SFINAE dispatch silently skips serialization, consuming 0 bytes and misaligning the stream. The compiler produces no error.

Ring buffer deadlock (kernel hangs)

The GPU producer and CPU consumer share the ring buffer. If the buffer fills, the producer spins waiting for the consumer. Ensure:

  • The CPU worker is running and polling the GPU queue lanes
  • RegisterGpuQueue and AssignGpuLanesToWorker were called before the kernel launch
  • The copy space is large enough for the serialized payload

CPU→GPU Wait() timeout

If future.Wait() times out after SendToGpu():

  • Verify the GPU orchestrator is running (not paused)
  • Verify the module's GPU container is registered (check for "Registered GPU container" log message)
  • Verify the call site is compiled with HSHM_ENABLE_CUDA=1 -- if compiled with HSHM_ENABLE_CUDA=0, the Send() GPU interceptor is absent and the task goes through normal CPU routing, which cannot complete a GPU-bound future

Client process GPU routing fails

If a pure client process (CHIMAERA_INIT(kClient, false)) tries to use LocalGpuBcast or ToLocalGpu and gets empty futures:

  • Binary not compiled with GPU support: ClientInitGpuQueues() is skipped, cpu2gpu_queues_ is empty. Use PoolQuery::Local() as a fallback (task goes to the server's CPU handler).
  • Stale installed libraries: After rebuilding with GPU queue extensions, all .so files must be reinstalled — not just libchimaera_cxx.so. Stale module .so files (e.g., libchimaera_admin_runtime.so) cause ClientConnect serialization mismatches because the server loads modules from /usr/local/lib/. Fix: sudo cp /workspace/build/bin/lib*.so /usr/local/lib/ && sudo ldconfig
  • Server has no GPUs: ClientConnect returns num_gpus_ == 0, so no queues are attached. This is expected behavior.

Kernel launch failure (-201)

Check cudaGetLastError() after the kernel launch. Common causes:

  • Too many threads per block (ArenaAllocator table won't fit in memory)
  • Missing CHIMAERA_GPU_INIT (uninitialized shared memory access)

ShmPtr resolution failure on host

The CPU worker must be able to resolve ShmPtrs that point into GPU-allocated pinned memory. Call RegisterGpuAllocator before launching the kernel:

CHI_IPC->RegisterGpuAllocator(backend_id, gpu_backend.data_,
gpu_backend.data_capacity_);

GPU orchestrator SM starvation

The persistent orchestrator occupies all SMs (32 blocks by default). To launch other GPU kernels:

  1. Call CHI_IPC->PauseGpuOrchestrator()
  2. Launch your kernel on a separate stream
  3. Synchronize your stream (not cudaDeviceSynchronize)
  4. Call CHI_IPC->ResumeGpuOrchestrator()