GPU Client Kernels
This guide explains how CUDA or ROCm kernels can submit tasks to the Chimaera runtime and receive results, using the same client API as host code. It also covers cross-device routing: CPU code sending tasks to run on the GPU, and GPU kernels sending tasks to run on the CPU.
Overview
The Chimaera runtime supports four GPU task routing modes:
| Routing Mode | Direction | Description |
|---|---|---|
PoolQuery::Local() | GPU → GPU | GPU kernel submits a task processed by the GPU work orchestrator |
PoolQuery::LocalGpuBcast() | CPU → GPU | CPU submits a task processed by the GPU work orchestrator |
PoolQuery::ToLocalGpu(gpu_id) | CPU → GPU | CPU submits a task to a specific GPU device |
PoolQuery::ToLocalCpu() | GPU → CPU | GPU kernel submits a task processed by a CPU worker |
All paths use the same client.AsyncMyTask(query, ...) API. The IpcManager::Send() method detects the routing mode and dispatches to the correct transport automatically.
Architecture
CPU Host GPU Work Orchestrator
| |
| AsyncGpuSubmit(LocalGpuBcast()) |
| -> Send() detects LocalGpuBcast |
| -> SendToGpu(): serialize + push cpu2gpu_queue ---> | pop + deserialize
| | dispatch to GpuRuntime
| Wait(): spin on FUTURE_COMPLETE | serialize output
| <- deserialize output from FutureShm <--- | threadfence_system + FUTURE_COMPLETE
| |
GPU Kernel CPU Worker
| |
| AsyncGpuSubmit(ToLocalCpu()) |
| -> SendGpu(): serialize + push gpu2cpu_queue ---> | dequeue + deserialize
| | execute task
| Wait(): RecvGpu reads output ring buffer <--- | serialize output + FUTURE_COMPLETE
| |
GPU Kernel GPU Work Orchestrator
| |
| AsyncGpuSubmit(Local()) |
| -> SendGpuLocal(): push gpu2gpu_queue ---> | pop + deserialize
| | dispatch to GpuRuntime
| Wait(): RecvGpu reads output ring buffer <--- | serialize output + FUTURE_COMPLETE
Runtime vs. Client Process Execution Modes
The Chimaera runtime supports two process configurations for GPU task submission:
| Mode | Init Call | GPU Resources | Use Case |
|---|---|---|---|
| In-process runtime | CHIMAERA_INIT(kClient, true) | Full: all GPU queues, orchestrator, GPU memory backends | Single process with embedded server |
| Separate client process | CHIMAERA_INIT(kClient, false) | None: no GPU queues or orchestrator | Client connects to standalone server |
In-Process Runtime (Embedded Server)
When CHIMAERA_INIT(kClient, true) is called, the process runs both client and server:
// One process does everything
chi::CHIMAERA_INIT(chi::ChimaeraMode::kClient, true);
chimaera::my_module::Client client(pool_id);
client.AsyncCreate(chi::PoolQuery::Dynamic(), name, pool_id).Wait();
// All GPU routing modes work:
// Local(), LocalGpuBcast(), ToLocalGpu(), ToLocalCpu()
auto future = client.AsyncGpuSubmit(chi::PoolQuery::LocalGpuBcast(), 0, value);
future.Wait();
What gets initialized:
ServerInit()runs in sequence:IpcManager::ServerInit()→ServerInitGpuQueues()creates all GPU queue infrastructureWorkOrchestrator::Init()+StartWorkers()launches worker threadsPoolManager::ServerInit()initializes pool managementLaunchGpuOrchestrator()launches persistent GPU kernel (deferred until after pools are composed)StartLocalServer()opens the SHM segment for external clients
ClientInit()initializes the client-side IPC connection
All four GPU routing modes are available because the process owns the GPU queues and orchestrator.
Separate Client Process
When CHIMAERA_INIT(kClient, false) is called, the process is a pure client. During the ClientConnect handshake, the server transmits GPU queue metadata (offsets, backend sizes, IPC handles) back to the client. If the client binary was compiled with CUDA/ROCm support, ClientInitGpuQueues() attaches to the server's GPU queue backends and reconstructs local queue pointers, enabling direct GPU task submission.
// Server is a separate process
// Client connects via SHM, TCP, or IPC transport
chi::CHIMAERA_INIT(chi::ChimaeraMode::kClient, false);
chimaera::my_module::Client client(pool_id);
client.AsyncCreate(chi::PoolQuery::Dynamic(), name, pool_id).Wait();
// PoolQuery::Local() always works (routed to server's CPU worker via SHM/TCP/IPC)
auto future = client.AsyncGpuSubmit(chi::PoolQuery::Local(), 0, value);
future.Wait();
// Server's CPU handler executes: result = test_value * 2 + gpu_id
// With GPU support compiled in, LocalGpuBcast also works (direct GPU queue push)
auto gpu_future = client.AsyncGpuSubmit(chi::PoolQuery::LocalGpuBcast(), 0, value);
gpu_future.Wait();
// Server's GPU handler executes: result = test_value * 2 + gpu_id
What gets initialized:
ClientInit()only — noServerInit():ConfigManager::Init()readschimaera.yamlIpcManager::ClientInit()connects to the server via SHM, TCP, or IPC- Admin client singleton is created
- GPU queue attachment (if compiled with CUDA/ROCm):
ClientConnectresponse carries GPU queue offsets, backend sizes, and IPC handlesClientInitGpuQueues()attaches to each GPU'sGpuShmMmapqueue backend viashm_attach()cpu2gpu_queues_,gpu2gpu_queues_are reconstructed from the received offsetsRegisterGpuAllocator()is called so the client can resolveShmPtrs into queue memory
- No GPU orchestrator:
gpu_orchestrator_isnullptr(the server runs the orchestrator)
Available routing:
| Routing Mode | Works? | Why |
|---|---|---|
PoolQuery::Local() | Yes | Goes through SendShm() or SendZmq() to server's CPU worker |
PoolQuery::LocalGpuBcast() | Yes (with GPU) | SendToGpu() pushes to attached cpu2gpu_queue → server's GPU orchestrator |
PoolQuery::ToLocalGpu(id) | Yes (with GPU) | Same path, targeting a specific GPU device |
PoolQuery::ToLocalCpu() | No | Only callable from GPU device code |
Without GPU support compiled in, LocalGpuBcast and ToLocalGpu return empty futures (no cpu2gpu_queues_ attached). Use PoolQuery::Local() as a fallback — the server's CPU handler executes the task instead.
ClientConnect GPU Queue Handshake
The ClientConnectTask response includes these GPU-specific fields:
| Field | Type | Description |
|---|---|---|
num_gpus_ | u32 | Number of GPU devices on the server |
cpu2gpu_queue_off_[8] | u64[8] | Byte offsets of cpu2gpu_queue within each GPU's GpuShmMmap |
gpu2cpu_queue_off_[8] | u64[8] | Byte offsets of gpu2cpu_queue (GPU→CPU) |
gpu2gpu_queue_off_[8] | u64[8] | Byte offsets of gpu2gpu_queue |
gpu_queue_backend_size_[8] | u64[8] | Size of each GPU's GpuShmMmap backend |
gpu_queue_depth_ | u32 | Queue depth (number of slots) |
The client uses these offsets to reconstruct TaskQueue pointers. The cpu2gpu and gpu2cpu queues reside in the pinned host GpuShmMmap backend; the gpu2gpu queue resides in device memory. A single shm_attach() per GPU attaches the pinned backend; the gpu2gpu backend is attached separately via its IPC handle.
Cross-Process GPU Memory Registration
Clients can register GPU device memory backends with the server using the extended RegisterMemoryTask. This enables the server to resolve ShmPtrs that point into GPU memory allocated by a different process.
// Client allocates a GpuMalloc backend
hipc::MemoryBackendId backend_id(static_cast<u32>(getpid()), 100);
hipc::GpuMalloc gpu_backend;
gpu_backend.shm_init(backend_id, data_size, "", gpu_id);
// Get IPC handle from the backend's private header
hipc::GpuMallocPrivateHeader priv_header;
GpuApi::Memcpy(&priv_header, gpu_backend.GetPrivateHeader(), sizeof(priv_header));
// Register with server via RegisterMemory(kGpuDeviceMemory)
auto reg_task = ipc->NewTask<admin::RegisterMemoryTask>(
CreateTaskId(), kAdminPoolId, PoolQuery::Local(),
backend_id,
admin::MemoryType::kGpuDeviceMemory,
gpu_id, data_size, &priv_header.ipc_handle_);
auto future = ipc->SendZmq(reg_task, IpcMode::kTcp);
future.Wait();
// Server opens the IPC handle and registers the allocator in gpu_alloc_map_
The MemoryType enum controls the registration path:
| MemoryType | Value | Backend | Registration Path |
|---|---|---|---|
kCpuMemory | 0 | POSIX shared memory | Existing IpcManager::RegisterMemory() |
kPinnedHostMemory | 1 | GpuShmMmap (pinned) | Attach via shm_attach() + RegisterGpuAllocator() |
kGpuDeviceMemory | 2 | GpuMalloc (device) | Open IPC handle via shm_attach_ipc() + RegisterGpuAllocator() |
Fork-Based Test Pattern
For testing, a common pattern forks a child server and runs a pure client in the parent:
int main(int argc, char* argv[]) {
// Child process: server mode
if (argc > 1 && std::string(argv[1]) == "--server-mode") {
CHIMAERA_INIT(ChimaeraMode::kServer, true);
sleep(300); // Wait for parent to kill us
return 0;
}
// Parent: fork server, then run as client
pid_t server = fork();
if (server == 0) {
setpgid(0, 0);
execl("/proc/self/exe", argv[0], "--server-mode", nullptr);
_exit(1);
}
setpgid(server, server);
WaitForServer(); // Poll for SHM segment file
// Pure client
setenv("CHI_WITH_RUNTIME", "0", 1);
CHIMAERA_INIT(ChimaeraMode::kClient, false);
// Submit tasks via Local() routing
chimaera::my_module::Client client(pool_id);
client.AsyncCreate(PoolQuery::Dynamic(), name, pool_id).Wait();
auto future = client.AsyncGpuSubmit(PoolQuery::Local(), 0, 42);
future.Wait();
// result = 42 * 2 + 0 = 84 (CPU handler)
// Cleanup
kill(-server, SIGKILL);
waitpid(server, &status, 0);
}
The execl("/proc/self/exe", ...) pattern avoids inheriting the parent's CHIMAERA_INIT static guard state.
See context-runtime/modules/MOD_NAME/test/test_gpu_client_process.cc for a complete implementation of this pattern with 5 test cases covering CPU routing, GPU queue attachment, GPU routing, and cross-process memory registration.
Memory Allocation Architecture
GPU Queue Memory (Server Only)
During ServerInitGpuQueues(), the server allocates pinned host memory backends for GPU communication. Each GPU device gets:
| Backend | ID | Type | Size | Purpose |
|---|---|---|---|---|
| CPU→GPU queue backend | 1000 + gpu_id | GpuShmMmap (pinned) | 64 MB | Holds TaskQueue structures (cpu2gpu, gpu2cpu) and FutureShm allocations for CPU→GPU tasks |
| GPU→GPU queue backend | 3000 + gpu_id | Device memory | 32 MB | Holds gpu2gpu_queue and per-thread ArenaAllocator storage for GPU→GPU tasks |
| Orchestrator scratch | 2000 + gpu_id | GpuShmMmap (pinned) | 64 MB | Per-block ArenaAllocator storage for the persistent orchestrator kernel |
| GPU heap | 9000 + gpu_id | GpuMalloc (device) | 64 MB | Per-thread BuddyAllocator (CHI_GPU_HEAP_T) for serialization scratch buffers |
GpuShmMmap backends use cudaHostAlloc (pinned host memory) accessible from both CPU and GPU via UVA. GpuMalloc backends use cudaMalloc (device memory) for GPU-only allocations.
Server Process Memory Layout (per GPU):
+-----------------------------------------------------------------+
| GpuShmMmap (backend ID 1000+gpu_id, 64 MB pinned host memory) |
| URL: /chi_gpu_queue_{gpu_id} |
| |
| +-----------+ +-----------+ |
| | TaskQueue | | TaskQueue | |
| | gpu2cpu_q | | cpu2gpu_q | |
| | (GPU->CPU)| | (CPU->GPU)| |
| +-----------+ +-----------+ |
| |
| +-----------------------------+ |
| | FutureShm allocations | (allocated by SendToGpu) |
| | [FutureShm + copy_space] | |
| +-----------------------------+ |
+-----------------------------------------------------------------+
+-----------------------------------------------------------------+
| Device memory (backend ID 3000+gpu_id, 32 MB) |
| URL: /chi_gpu2gpu_queue_{gpu_id} |
| |
| +-----------+ |
| | TaskQueue | |
| | gpu2gpu_q | |
| | (GPU->GPU)| |
| +-----------+ |
| |
| +--------------------------------------------+ |
| | ArenaAllocator per-thread (GPU->GPU tasks) | |
| +--------------------------------------------+ |
+-----------------------------------------------------------------+
+-----------------------------------------------------------------+
| GpuShmMmap (backend ID 2000+gpu_id, 64 MB pinned host memory) |
| URL: /chi_gpu_orchestrator_{gpu_id} |
| |
| +--------------------------------------------+ |
| | Block 0: ArenaAllocator (per-thread bump) | |
| +--------------------------------------------+ |
| | Block 1: ArenaAllocator | |
| +--------------------------------------------+ |
| | ... | |
| +--------------------------------------------+ |
+-----------------------------------------------------------------+
+-----------------------------------------------------------------+
| GpuMalloc (backend ID 9000+gpu_id, 64 MB device memory) |
| URL: /chi_gpu_heap_{gpu_id} |
| |
| +--------------------------------------------+ |
| | Block 0: BuddyAllocator (CHI_GPU_HEAP_T) | |
| +--------------------------------------------+ |
| | Block 1: BuddyAllocator | |
| +--------------------------------------------+ |
| | ... | |
| +--------------------------------------------+ |
+-----------------------------------------------------------------+
Client Process Memory
A pure client (CHIMAERA_INIT(kClient, false)) does not create GPU backends, but attaches to the server's GPU queue backends during ClientInitGpuQueues() (if compiled with GPU support). Tasks can be sent through:
- SHM mode:
FutureShmallocated in the server's shared memory segment (chi_main_segment). The client writes serialized task data intoFutureShm::copy_spaceviaShmTransport::Send(). - TCP/IPC mode:
FutureShmallocated viaHSHM_MALLOC(process-local heap). Task data serialized and sent through ZMQ DEALER socket. - Direct GPU queue push (GPU-compiled clients):
SendToGpu()serializes into the attachedGpuShmMmapqueue backend and pushes tocpu2gpu_queue. The server's GPU orchestrator dequeues and processes the task.
Client Process (GPU-compiled) Memory Layout:
Attached from server (read/write via pinned host memory):
+-----------------------------------------------------------------+
| GpuShmMmap (backend 1000+gpu_id) — attached via shm_attach() |
| cpu2gpu_queue, gpu2gpu_queue pointers reconstructed |
| FutureShm allocations for SendToGpu live here |
+-----------------------------------------------------------------+
Optionally registered with server:
+-----------------------------------------------------------------+
| GpuMalloc (client-owned, backend pid:unique_id) |
| Device memory allocated via cudaMalloc |
| IPC handle sent to server via RegisterMemory(kGpuDeviceMemory) |
| Server opens handle to resolve ShmPtrs from this backend |
+-----------------------------------------------------------------+
GPU Kernel Memory (User-Allocated)
GPU kernels that use the client API require two memory backends passed in IpcManagerGpuInfo:
Primary backend (backend): GpuShmMmap pinned host memory (or GpuMalloc device memory for GPU→GPU). Provides per-thread ArenaAllocator (HSHM_DEFAULT_ALLOC_GPU_T) for fast bump-pointer allocation of FutureShm and task objects.
Heap backend (gpu_heap_backend): GpuMalloc device memory. Provides per-thread BuddyAllocator (CHI_GPU_HEAP_T) for serialization scratch buffers in LocalSaveTaskArchive / LocalLoadTaskArchive. Unlike the arena, the buddy allocator supports individual free(), so scratch memory is reclaimed after each task without exhausting the primary arena.
Use CHI_IPC->GetClientGpuInfo(gpu_id) to build the IpcManagerGpuInfo for same-process kernel launches — it fills all fields automatically:
// Same-process launch (Chimaera running in-process):
chi::IpcManagerGpuInfo gpu_info = CHI_IPC->GetClientGpuInfo(0);
// gpu_info.backend, .gpu2gpu_queue, .gpu_heap_backend all set
// Register primary backend so CPU can resolve GPU ShmPtrs
CHI_IPC->RegisterGpuAllocator(gpu_info.backend.id_,
gpu_info.backend.data_,
gpu_info.backend.data_capacity_);
The CHIMAERA_GPU_INIT macro partitions both backends per-thread:
backend (ArenaAllocator, HSHM_DEFAULT_ALLOC_GPU_T):
+------------------------------------------------------+
| Pointer table (num_threads * sizeof(ptr)) |
+------------------------------------------------------+
| ArenaAllocator 0 (per_thread_size bytes) |
+------------------------------------------------------+
| ArenaAllocator 1 (per_thread_size bytes) |
+------------------------------------------------------+
| ... |
+------------------------------------------------------+
gpu_heap_backend (BuddyAllocator, CHI_GPU_HEAP_T):
+------------------------------------------------------+
| Pointer table (num_threads * sizeof(ptr)) |
+------------------------------------------------------+
| BuddyAllocator 0 (per_thread_size bytes) |
+------------------------------------------------------+
| BuddyAllocator 1 (per_thread_size bytes) |
+------------------------------------------------------+
| ... |
+------------------------------------------------------+
The ArenaAllocator is a bump-pointer allocator; it is reset (via alloc->Reset()) after each task completes — safe because only one task is in-flight per thread at a time. The BuddyAllocator reclaims memory individually as serialization vectors go out of scope. Access the per-thread heap via CHI_GPU_HEAP (expands to CHI_IPC->GetGpuHeap()).
GPU Initialization Sequence
Server GPU Init (Full Sequence)
The server initializes GPU resources in three phases during ServerInit():
Phase 1: Queue creation (IpcManager::ServerInitGpuQueues())
- Query
hshm::GpuApi::GetDeviceCount()for number of GPU devices - For each GPU, create a 64 MB
GpuShmMmappinned host backend (ID1000+gpu_id) forcpu2gpuandgpu2cpuqueues plus CPU→GPUFutureShmallocations - Create a device-memory backend (ID
3000+gpu_id) for thegpu2gpu_queueand its per-threadArenaAllocator - Create two
TaskQueueobjects in the pinned backend:gpu2cpu_queues_[gpu_id]— GPU→CPU queue (CPU worker polls this)cpu2gpu_queues_[gpu_id]— CPU→GPU queue (orchestrator polls this)
- Create one
TaskQueuein device memory:gpu2gpu_queues_[gpu_id]— GPU→GPU queue (orchestrator polls this)
- Each queue has 1 lane, 2 priorities (normal + resumed), configurable depth
- Create orchestrator scratch backends (64 MB
GpuShmMmap, IDs2000+gpu_id) - Create GPU heap backends (64 MB
GpuMalloc, IDs9000+gpu_id) forCHI_GPU_HEAP_T = BuddyAllocator - Register all backends with
RegisterGpuAllocator()for host-sideShmPtrresolution - Populate
gpu_orchestrator_info_struct for GPU 0 includinggpu_heap_backend
Phase 2: Pool composition (deferred)
- All compose-section pools are created before the orchestrator launches
- GPU containers are allocated during pool creation via autogenerated allocation kernels
Phase 3: Orchestrator launch (IpcManager::LaunchGpuOrchestrator())
- Read
gpu_blocksandgpu_threads_per_blockfrom configuration - Set CUDA stack size to 131072 bytes (
cudaDeviceSetLimit) - Allocate
WorkOrchestratorControlin pinned host memory - Allocate
gpu::PoolManageron device - Create dedicated CUDA stream
- Launch persistent kernel:
chimaera_gpu_orchestrator<<<blocks, threads_per_block, 0, stream>>>() - Only block 0, thread 0 runs the
gpu::Workerpoll loop; other threads spin-wait for exit
Why deferred? The orchestrator occupies all SMs with persistent thread blocks. If launched before pool composition, cudaMalloc calls during GPU container allocation would deadlock against the persistent kernel.
Client GPU Init (None)
ClientInit() performs no GPU initialization:
- No
ServerInitGpuQueues()—cpu2gpu_queues_remains empty - No
LaunchGpuOrchestrator()—gpu_orchestrator_remainsnullptr - No GPU memory backends allocated
The client can only submit tasks through CPU transport (SHM, TCP, IPC).
Queue Assignment Algorithm
Worker Thread Partitioning
The DefaultScheduler::DivideWorkers() method partitions the configured worker threads into four roles:
Given N total worker threads (from num_threads in chimaera.yaml):
Worker 0 -> Scheduler worker (handles task routing)
Workers 1..N-3 -> I/O workers (execute tasks)
Worker N-2 -> GPU worker (polls GPU->CPU lanes)
Worker N-1 -> Network worker (handles ZMQ send/recv)
Example with num_threads: 4:
- Worker 0: scheduler
- Worker 1: I/O worker (also used as GPU worker since N-2 = 2, overlaps I/O range)
- Worker 2: GPU worker
- Worker 3: network worker
The GPU worker is assigned only when total_workers > 2. With 2 or fewer workers, there is no dedicated GPU worker and GPU→CPU tasks cannot be processed.
GPU Lane Assignment
After ServerInitGpuQueues() creates the per-GPU TaskQueue objects, AssignGpuLanesToWorker() connects them to the GPU worker:
void IpcManager::AssignGpuLanesToWorker() {
Worker *gpu_worker = scheduler_->GetGpuWorker();
// Collect lane 0 from each GPU's gpu_queue (GPU->CPU)
for (size_t gpu_id = 0; gpu_id < num_gpus; ++gpu_id) {
TaskLane *gpu_lane = &gpu_queues_[gpu_id]->GetLane(0, 0);
gpu_lane->SetAssignedWorkerId(gpu_worker->GetId());
gpu_lanes.push_back(gpu_lane);
}
gpu_worker->SetGpuLanes(gpu_lanes);
}
Result: The single GPU worker polls all GPU→CPU lanes across all GPUs. The GPU work orchestrator (persistent kernel) polls the CPU→GPU and GPU→GPU queues independently.
Queue Routing Summary
Pure Client Process Server Process
+-----------------+ +------------------+
| No GPU queues | | cpu2gpu_queues_[]|----> GPU Orchestrator
| No orchestrator| | gpu2gpu_queues_[]|----> GPU Orchestrator
AsyncGpuSubmit() ->-| SendShm() or |--SHM/TCP/-->| gpu_queues_[] |----> GPU Worker (CPU)
(PoolQuery::Local) | SendZmq() | IPC | |
+-----------------+ +------------------+
|
CPU Workers execute
the task's CPU handler
Client-to-Server Transport Modes
The pure client's transport mode is determined by the CHI_IPC_MODE environment variable or chimaera.yaml configuration:
| Mode | Transport | How FutureShm is Allocated | How Task Data is Sent |
|---|---|---|---|
SHM | Shared memory | In server's SHM segment | Written to FutureShm::copy_space ring buffer |
TCP | ZMQ DEALER/ROUTER | Process-local heap (HSHM_MALLOC) | Serialized through ZMQ TCP socket |
IPC | ZMQ DEALER/ROUTER (Unix socket) | Process-local heap (HSHM_MALLOC) | Serialized through ZMQ IPC socket |
For SHM mode, the server worker deserializes from the FutureShm::copy_space ring buffer using ShmTransport::Recv(). For TCP/IPC mode, the server receives the serialized data through the ZMQ ROUTER socket and deserializes on the network worker thread.
Prerequisites
CMake flags
Enable GPU support when configuring the build:
# CUDA
cmake .. -DWRP_CORE_ENABLE_CUDA=ON
# ROCm
cmake .. -DWRP_CORE_ENABLE_ROCM=ON
These set the HSHM_ENABLE_CUDA or HSHM_ENABLE_ROCM preprocessor macros, which gate all GPU code paths.
Required headers
#include <chimaera/chimaera.h>
#include <chimaera/singletons.h>
#include <chimaera/task.h>
#include <chimaera/pool_query.h>
#include <hermes_shm/util/gpu_api.h>
// Your module's client and task headers
#include <chimaera/my_module/my_module_client.h>
#include <chimaera/my_module/my_module_tasks.h>
Writing a GPU-Compatible Task
A task that can be created and submitted from a GPU kernel must:
- Mark constructors and serialization methods with
HSHM_CROSS_FUN - Provide
SerializeInandSerializeOutmethods - Use only GPU-safe types (no
std::string,std::vector, etc.)
struct MyGpuTask : public chi::Task {
IN chi::u32 input_value_;
INOUT chi::u32 result_;
HSHM_CROSS_FUN MyGpuTask() : chi::Task(), input_value_(0), result_(0) {}
HSHM_CROSS_FUN explicit MyGpuTask(
const chi::TaskId& task_id,
const chi::PoolId& pool_id,
const chi::PoolQuery& query,
chi::u32 input_value)
: chi::Task(task_id, pool_id, query, kMethodId),
input_value_(input_value), result_(0) {
task_id_ = task_id;
pool_id_ = pool_id;
method_ = kMethodId;
task_flags_.Clear();
pool_query_ = query;
}
// Serialization: fields sent GPU -> CPU
template <typename Archive>
HSHM_CROSS_FUN void SerializeIn(Archive& ar) {
Task::SerializeIn(ar);
ar(input_value_, result_);
}
// Serialization: fields sent CPU -> GPU (after execution)
template <typename Archive>
HSHM_CROSS_FUN void SerializeOut(Archive& ar) {
Task::SerializeOut(ar);
ar(result_);
}
};
Important: Every function called during serialization must be HSHM_CROSS_FUN (i.e., __host__ __device__). Without this annotation, NVCC's SFINAE-based dispatch silently skips serialization on the GPU, producing corrupted data with no compiler error.
Writing a GPU-Compatible Client
The client class needs HSHM_CROSS_FUN on the constructor and any methods called from GPU code:
class Client : public chi::ContainerClient {
public:
HSHM_CROSS_FUN Client() = default;
HSHM_CROSS_FUN explicit Client(const chi::PoolId& pool_id) { Init(pool_id); }
// Host-only methods (AsyncCreate, etc.) stay as-is...
// GPU-callable task submission
HSHM_CROSS_FUN
chi::Future<MyGpuTask> AsyncMyTask(const chi::PoolQuery& query,
chi::u32 input_value) {
auto* ipc = CHI_IPC;
auto task = ipc->NewTask<MyGpuTask>(
chi::CreateTaskId(), pool_id_, query, input_value);
return ipc->Send(task);
}
};
CHI_IPC->Send() automatically dispatches based on compilation context and routing mode:
- GPU device code (
HSHM_IS_GPU): routes viaSendGpuorSendGpuLocal - CPU host code with
HSHM_ENABLE_CUDA=1: detectsLocalGpuBcast/ToLocalGpuand routes viaSendToGpu - CPU host code without GPU: routes through normal CPU routing (pool query → worker)
GPU Routing Modes
GPU → GPU (Local)
A GPU kernel submits a task that is processed by the GPU work orchestrator. The task runs on the GPU itself.
// Inside a __global__ kernel
chimaera::my_module::Client client(pool_id);
auto future = client.AsyncMyTask(chi::PoolQuery::Local(), input);
future.Wait();
chi::u32 result = future->result_;
The GPU orchestrator must be running (not paused). The task is dispatched to your module's GpuRuntime::MyMethod() handler.
CPU → GPU (LocalGpuBcast)
CPU host code submits a task to be processed by the GPU work orchestrator. The task runs on the GPU.
// Host code -- must be compiled with HSHM_ENABLE_CUDA=1
chimaera::my_module::Client client(pool_id);
auto future = client.AsyncMyTask(chi::PoolQuery::LocalGpuBcast(), input);
future.Wait(); // blocks until GPU completes + deserializes output
chi::u32 result = future->result_;
How it works:
Send()detectsLocalGpuBcastand callsSendToGpu()instead of normal CPU routingSendToGpu()allocates aFutureShmin GPU-accessible pinned host memory, serializes task input into its ring buffer, and pushes to thecpu2gpu_queue- The GPU orchestrator pops the task, deserializes, dispatches to
GpuRuntime, serializes output, and setsFUTURE_COMPLETE Wait()spins onFUTURE_COMPLETE, then deserializes output from the FutureShm ring buffer back into the task
Important -- HSHM_ENABLE_CUDA must be 1: The Send() GPU routing interceptor is inside #if HSHM_ENABLE_CUDA || HSHM_ENABLE_ROCM. If your source file is compiled with HSHM_ENABLE_CUDA=0 (e.g., a plain C++ file in a mixed CUDA/C++ target), the interceptor is compiled out and the task will go through normal CPU routing, which does not complete the future correctly for GPU tasks.
Solution: Call AsyncMyTask(LocalGpuBcast(), ...) from a source file compiled as CUDA (.cu or with LANGUAGE CUDA property), or from a wrapper function defined in such a file.
CPU → Specific GPU (ToLocalGpu)
Same as LocalGpuBcast but targets a specific GPU by device ID:
// Send to GPU device 1
auto future = client.AsyncMyTask(chi::PoolQuery::ToLocalGpu(1), input);
future.Wait();
GPU → CPU (ToLocalCpu)
A GPU kernel submits a task to be processed by a CPU worker thread. The task runs on the CPU.
// Inside a __global__ kernel
chimaera::my_module::Client client(pool_id);
auto future = client.AsyncMyTask(chi::PoolQuery::ToLocalCpu(), input);
future.Wait();
chi::u32 result = future->result_;
How it works:
Send()on the GPU callsSendGpu(), which serializes the task and pushes to thegpu2cpu_queue- The CPU GPU-worker thread polls this queue, deserializes, and routes to the module's CPU runtime handler
- The CPU worker serializes output into the FutureShm ring buffer and sets
FUTURE_COMPLETE Wait()on the GPU callsRecvGpu()which reads from the output ring buffer
Host setup required: The GPU kernel needs a gpu2cpu_queue registered with the runtime. See Host-Side Setup for GPU → CPU below.
GPU Work Orchestrator
The GPU work orchestrator is a persistent CUDA kernel that processes tasks on the GPU. It runs on a dedicated CUDA stream and polls two queues:
cpu2gpu_queue-- tasks pushed by CPU viaSendToGpu()gpu2gpu_queue-- tasks pushed by GPU kernels viaSendGpuLocal()
The runtime launches the orchestrator automatically during ServerInit() when GPU support is enabled.
Persistent Kernel Structure
The orchestrator is launched with configurable blocks and threads per block. Only block 0, thread 0 runs the worker poll loop; all other threads spin-wait for the exit signal:
__global__ void chimaera_gpu_orchestrator(gpu::PoolManager *pool_mgr,
gpu::WorkOrchestratorControl *control,
IpcManagerGpuInfo gpu_info,
u32 num_blocks) {
CHIMAERA_GPU_ORCHESTRATOR_INIT(gpu_info, num_blocks);
if (blockIdx.x == 0 && threadIdx.x == 0) {
control->running_flag = 1;
gpu::Worker worker;
worker.Init(0, gpu_info.cpu2gpu_queue, gpu_info.gpu2gpu_queue,
pool_mgr, gpu_info.cpu2gpu_queue_base);
while (!control->exit_flag) {
worker.PollOnce();
}
worker.Finalize();
}
// Other blocks/threads: wait for exit signal
if (blockIdx.x != 0 || threadIdx.x != 0) {
while (!control->exit_flag) { /* spin */ }
}
}
The extra blocks are launched to occupy all SMs, preventing other kernels from preempting the orchestrator.
GPU Container (GpuRuntime)
Each module that supports GPU execution defines a GpuRuntime class:
// my_module_gpu_runtime.h
class GpuRuntime : public chi::gpu::Container {
public:
HSHM_GPU_FUN GpuRuntime() = default;
HSHM_GPU_FUN ~GpuRuntime() override = default;
/** GPU handler for MyMethod */
HSHM_GPU_FUN void MyMethod(hipc::FullPtr<MyGpuTask> task,
chi::gpu::GpuRunContext &rctx) {
task->result_ = task->input_value_ * 3;
}
// Autogenerated virtual method overrides (switch-case dispatch)
#include "autogen/my_module_gpu_lib_exec.h"
};
Module YAML Configuration
Enable GPU support in your module's chimaera_mod.yaml:
module_name: my_module
namespace: chimaera
# ... method IDs ...
kMyMethod: 25
# GPU support
has_gpu: true
gpu_methods:
- kMyMethod
Run chimaera repo refresh . to regenerate the autogen dispatch code:
- Per-module:
autogen/my_module_gpu_lib_exec.h(virtual method overrides) - Repo-level:
src/autogen/gpu_work_orchestrator_modules.h(container allocation + registry)
Pause/Resume
The GPU orchestrator occupies all SMs with persistent thread blocks. To launch other GPU kernels (e.g., test kernels), you must pause the orchestrator first:
CHI_IPC->PauseGpuOrchestrator(); // signal exit + synchronize stream
// ... launch your kernel on a separate stream ...
CHI_IPC->ResumeGpuOrchestrator(); // relaunch persistent kernel
Important: Use stream-based synchronization (hshm::GpuApi::Synchronize(stream)) for your own kernels, not cudaDeviceSynchronize(). The latter would block on the orchestrator's stream.
Container Registration
GPU containers are automatically allocated and registered when a pool with has_gpu: true is created. The allocation kernel runs in the orchestrator's CUDA module context so vtables are correct for virtual dispatch.
GPU Kernel Implementation
Initialization Macros
Three macros are available depending on the kernel's role:
CHIMAERA_GPU_INIT(gpu_info)
For regular client kernels (not the orchestrator). Initializes the per-block IpcManager from a fully-populated IpcManagerGpuInfo. All threads in a block share one IpcManager instance (in __shared__ memory); thread 0 runs ClientInitGpu, then __syncthreads() makes it visible to all threads.
__global__ void my_kernel(chi::IpcManagerGpu gpu_info,
chi::PoolId pool_id,
chi::u32 input,
chi::u32* d_output) {
CHIMAERA_GPU_INIT(gpu_info);
chimaera::my_module::Client client(pool_id);
auto future = client.AsyncMyTask(chi::PoolQuery::Local(), input);
future.Wait();
*d_output = future->result_;
}
CHI_CLIENT_GPU_INIT(gpu_info)
Alias for CHIMAERA_GPU_INIT. Use this in client-process kernels for clarity:
__global__ void client_kernel(chi::IpcManagerGpu gpu_info, ...) {
CHI_CLIENT_GPU_INIT(gpu_info);
// identical behavior
}
CHIMAERA_GPU_ORCHESTRATOR_INIT(gpu_info, num_blocks)
For kernels that partition their memory across multiple blocks (e.g., benchmark client kernels and the runtime orchestrator itself). Before calling ClientInitGpu, this macro splits all three backends (backend, gpu2cpu_backend, gpu_heap_backend) so each block gets its own non-overlapping slice:
per_block = data_capacity / num_blocks
block N slice starts at: data_ + N * per_block
Use this when launching multiple blocks where each block submits tasks independently:
__global__ void bench_kernel(chi::IpcManagerGpu gpu_info,
chi::PoolId pool_id,
chi::u32 num_blocks,
chi::u32 total_tasks,
int* d_done) {
CHIMAERA_GPU_ORCHESTRATOR_INIT(gpu_info, num_blocks);
if (threadIdx.x != 0) return; // only thread 0 per block submits
chimaera::my_module::Client client(pool_id);
for (chi::u32 i = 0; i < total_tasks; ++i) {
auto future = client.AsyncMyTask(chi::PoolQuery::Local(), i);
future.Wait();
}
if (blockIdx.x == 0) {
__threadfence_system();
*d_done = 1;
}
}
Host-Side Setup
Host-Side Setup for GPU → GPU
For PoolQuery::Local() from a GPU kernel, the task goes through the gpu2gpu_queue which is managed by the orchestrator. The simplest approach is GetClientGpuInfo(), which fills all IpcManagerGpuInfo fields automatically from the running runtime:
// Build IpcManagerGpuInfo — fills backend, gpu2gpu_queue, gpu_heap_backend, etc.
chi::IpcManagerGpuInfo gpu_info = CHI_IPC->GetClientGpuInfo(0);
// Register primary backend so CPU can resolve GPU ShmPtrs
CHI_IPC->RegisterGpuAllocator(gpu_info.backend.id_,
gpu_info.backend.data_,
gpu_info.backend.data_capacity_);
For custom backends (e.g., benchmark kernels with a dedicated allocation region):
// Custom primary backend (pinned host, for ArenaAllocator)
hipc::MemoryBackendId backend_id(100, 0);
hipc::GpuShmMmap gpu_backend;
gpu_backend.shm_init(backend_id, 10 * 1024 * 1024, "/gpu_test", 0);
CHI_IPC->RegisterGpuAllocator(backend_id, gpu_backend.data_,
gpu_backend.data_capacity_);
// Custom heap backend (device memory, for BuddyAllocator / CHI_GPU_HEAP_T)
hipc::MemoryBackendId heap_id(101, 0);
hipc::GpuMalloc gpu_heap;
gpu_heap.shm_init(heap_id, 4 * 1024 * 1024, "/gpu_heap", 0);
chi::IpcManagerGpuInfo gpu_info;
gpu_info.backend = gpu_backend;
gpu_info.gpu_heap_backend = gpu_heap;
gpu_info.gpu2cu_queue = nullptr; // not using GPU->CPU
gpu_info.cpu2gpu_queue = nullptr; // not receiving CPU->GPU
gpu_info.gpu2gpu_queue = CHI_IPC->GetGpuToGpuQueue(0);
Use pinned host memory for result polling (not cudaDeviceSynchronize, which hangs with the persistent orchestrator):
int *d_result;
cudaMallocHost(&d_result, sizeof(int));
*d_result = 0;
// Pause orchestrator, launch kernel, resume orchestrator
CHI_IPC->PauseGpuOrchestrator();
my_kernel<<<1, 1, 0, stream>>>(gpu_info, pool_id, input, d_result, d_out);
CHI_IPC->ResumeGpuOrchestrator();
// Poll pinned memory for completion
while (*d_result == 0) {
std::this_thread::sleep_for(std::chrono::microseconds(100));
}
Host-Side Setup for GPU → CPU
For PoolQuery::ToLocalCpu() from a GPU kernel, the task goes through the gpu2cpu_queue that a CPU worker thread polls. The host must create and register this queue, and also provide a pinned-host gpu2cpu_backend so the GPU kernel can allocate FutureShm objects:
// 1. Primary backend (device memory, for ArenaAllocator)
hipc::MemoryBackendId backend_id(5, 0);
hipc::GpuMalloc gpu_backend;
gpu_backend.shm_init(backend_id, 10 * 1024 * 1024, "/gpu_to_cpu", 0);
// 2. GPU->CPU copy-space backend (pinned host, for FutureShm allocation)
hipc::MemoryBackendId copy_backend_id(6, 0);
hipc::GpuShmMmap copy_backend;
copy_backend.shm_init(copy_backend_id, 4 * 1024 * 1024, "/gpu_to_cpu_copy", 0);
CHI_IPC->RegisterGpuAllocator(copy_backend_id, copy_backend.data_,
copy_backend.data_capacity_);
// 3. GPU task queue (pinned shared memory)
hipc::MemoryBackendId queue_backend_id(7, 0);
hipc::GpuShmMmap queue_backend;
queue_backend.shm_init(queue_backend_id, 2 * 1024 * 1024, "/gpu_to_cpu_q", 0);
auto *queue_alloc = queue_backend.MakeAlloc<hipc::ArenaAllocator<false>>(
queue_backend.data_capacity_);
auto gpu_queue_ptr = queue_alloc->NewObj<chi::TaskQueue>(
queue_alloc, 1, 2, 1024);
// 4. Register queue and assign to CPU GPU-worker
CHI_IPC->RegisterGpuQueue(gpu_queue_ptr);
CHI_IPC->AssignGpuLanesToWorker();
CHI_IPC->RegisterGpuAllocator(backend_id, gpu_backend.data_,
gpu_backend.data_capacity_);
// 5. Build IpcManagerGpuInfo and launch kernel
chi::IpcManagerGpuInfo gpu_info;
gpu_info.backend = gpu_backend;
gpu_info.gpu2cpu_queue = gpu_queue_ptr.ptr_;
gpu_info.gpu2cpu_backend = copy_backend;
CPU → GPU (No Extra Host Setup)
For PoolQuery::LocalGpuBcast() from CPU host code, no additional host setup is needed beyond creating the pool. The runtime creates the cpu2gpu_queue and GPU orchestrator during ServerInit().
// Compile this in a CUDA source file (HSHM_ENABLE_CUDA=1)
chi::CHIMAERA_INIT(chi::ChimaeraMode::kClient, true);
chimaera::my_module::Client client(pool_id);
auto create = client.AsyncCreate(chi::PoolQuery::Dynamic(), name, pool_id);
create.Wait();
// Submit task to GPU -- Send() detects LocalGpuBcast -> SendToGpu()
auto future = client.AsyncMyTask(chi::PoolQuery::LocalGpuBcast(), input);
future.Wait(); // spins on FUTURE_COMPLETE, deserializes GPU output
chi::u32 result = future->result_;
Key Registration Functions
| Function | Purpose |
|---|---|
CHI_IPC->GetClientGpuInfo(gpu_id) | Build a fully-populated IpcManagerGpuInfo for same-process kernel launches |
CHI_IPC->RegisterGpuQueue(queue) | Registers a GPU task queue so the runtime can poll it |
CHI_IPC->AssignGpuLanesToWorker() | Assigns all registered GPU queue lanes to the designated GPU worker thread |
CHI_IPC->RegisterGpuAllocator(id, data, capacity) | Registers GPU backend memory so the host can resolve ShmPtrs allocated by the GPU kernel |
CHI_IPC->PauseGpuOrchestrator() | Pauses the persistent GPU orchestrator to free SMs |
CHI_IPC->ResumeGpuOrchestrator() | Resumes the GPU orchestrator |
CHI_IPC->GetGpuToGpuQueue(gpu_id) | Returns the orchestrator's GPU→GPU queue pointer for a given device |
CHI_GPU_HEAP | Per-thread BuddyAllocator (CHI_GPU_HEAP_T) from the GPU heap table; valid in device code after CHIMAERA_GPU_INIT |
Configuration Parameters
GPU Stack Size
GPU kernels that use the client API involve deep template instantiation chains (serialization, ring buffer, allocator). The default CUDA stack size (1024 bytes) is insufficient.
cudaDeviceSetLimit(cudaLimitStackSize, 131072); // 128 KB recommended
Symptoms of stack overflow: silent kernel completion with corrupted or missing data (no error reported by cudaGetLastError).
Ring Buffer Size (Copy Space)
The SPSC ring buffer size determines how much serialized data can be in-flight between GPU and CPU. This is set per-task:
// Inside your task class, override GetCopySpaceSize:
HSHM_CROSS_FUN size_t GetCopySpaceSize() const { return 8192; }
The default is 4096 bytes if GetCopySpaceSize() returns 0. The FutureShm is allocated as sizeof(FutureShm) + copy_space_size, with separate input_ and output_ ring buffers sharing the same copy space.
For tasks with large serialized payloads, increase this value. If the ring buffer is too small, the producer will spin-wait for the consumer to drain data.
GPU Memory Backend Sizes
Two backends must be sized independently:
Primary backend (backend, ArenaAllocator): holds FutureShm + task objects. Reset after each task so size only needs to cover one task at a time:
// Per-block: sizeof(FutureShm) + copy_space + sizeof(TaskT) + overhead
// 10 MB per block is sufficient for typical tasks
size_t backend_size = num_blocks * 10 * 1024 * 1024;
Heap backend (gpu_heap_backend, BuddyAllocator): holds serialization scratch vectors. Individual free() reclaims memory, so size covers peak concurrent scratch usage (~2× copy_space per task):
// Per-block: ~2 * copy_space_size + BuddyAllocator overhead
// 4 MB per block is sufficient for typical tasks
size_t heap_backend_size = num_blocks * 4 * 1024 * 1024;
A single task submission typically consumes from the primary arena:
sizeof(FutureShm) + copy_space_sizefor the futuresizeof(TaskT)for the deserialized task object (orchestrator side)
And from the heap:
- Input serialization vector (~
copy_space_size) - Output serialization vector (~
copy_space_size)
GPU Task Queue Configuration
// Parameters: allocator, num_lanes, num_priorities, depth
queue_alloc->NewObj<chi::TaskQueue>(queue_alloc, 1, 2, 1024);
| Parameter | Default | Description |
|---|---|---|
num_lanes | 1 | Number of queue lanes (1 per GPU is typical) |
num_priorities | 2 | Priority levels per lane |
depth | 1024 | Maximum tasks per lane before blocking |
Queue Backend Size
The queue backend holds the TaskQueue data structure and its internal lane storage:
size_t queue_memory_size = 2 * 1024 * 1024; // 2 MB is sufficient for most use cases
Runtime Thread Configuration
The Chimaera runtime automatically detects GPU devices and creates GPU queues during ServerInit. The default scheduler (DefaultScheduler) partitions workers as follows:
# chimaera.yaml
num_threads: 4
With num_threads: N:
| Worker ID | Role | Responsibility |
|---|---|---|
| 0 | Scheduler | Task routing and scheduling |
| 1 to N-3 | I/O workers | Execute tasks from CPU queues |
| N-2 | GPU worker | Polls GPU→CPU queue lanes |
| N-1 | Network worker | Handles ZMQ send/recv |
The GPU worker polls all GPU→CPU lanes across all GPU devices. With num_threads: 4, you get 1 scheduler, 1 I/O worker (which doubles as the GPU worker range), 1 GPU worker, and 1 network worker.
CMake Integration
Mixed CUDA/C++ Targets
When building a test or application that has both CUDA and C++ source files, keep GPU-routing calls in the CUDA source file. The typical CMake pattern:
# GPU kernels compiled as CUDA
set_source_files_properties(my_gpu_kernels.cc PROPERTIES LANGUAGE CUDA)
add_library(my_gpu_kernels OBJECT my_gpu_kernels.cc)
target_link_libraries(my_gpu_kernels PRIVATE hshm::cuda_cxx)
# Main executable with CPU sources
add_executable(my_app my_main.cc $<TARGET_OBJECTS:my_gpu_kernels>)
# CRITICAL: CPU sources are compiled with HSHM_ENABLE_CUDA=0 to avoid
# __device__ errors. GPU routing in Send() is compiled out.
set_source_files_properties(my_main.cc PROPERTIES
COMPILE_OPTIONS "-UHSHM_ENABLE_CUDA;-DHSHM_ENABLE_CUDA=0"
)
Consequence: client.AsyncMyTask(LocalGpuBcast(), ...) called from my_main.cc will NOT route to the GPU because the Send() interceptor is compiled out. Instead, define a wrapper in the CUDA source file:
// my_gpu_kernels.cc (compiled as CUDA, HSHM_ENABLE_CUDA=1)
extern "C" int run_cpu_to_gpu(chi::PoolId pool_id, chi::u32 input,
chi::u32 *out_result) {
chimaera::my_module::Client client(pool_id);
auto future = client.AsyncMyTask(chi::PoolQuery::LocalGpuBcast(), input);
if (!future.Wait(10.0f)) return -3; // timeout
*out_result = future->result_;
return 1;
}
GPU Module Include Directories
GPU targets need $<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}> for autogen headers:
target_include_directories(my_gpu_target PRIVATE
$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}>
)
Data Flow Details
Send Path (GPU → CPU)
- NewTask: Allocates the task from the per-thread
ArenaAllocator - SendGpu: Allocates a
FutureShmwith embedded ring buffer copy space - Enqueue: Pushes the
Futureonto the GPU task queue lane (the CPU worker polls this) - Serialize + Send: Builds a
LocalSaveTaskArchive, callstask->SerializeIn(ar), then streams the archive throughShmTransport::Sendinto the input ring buffer
Receive Path (CPU → GPU)
- CPU worker dequeues the
FutureShm, reads from the input ring buffer viaShmTransport::Recv, deserializes, executes the task, then writes output back through the output ring buffer - RecvGpu: On the GPU,
Future::Wait()callsRecvGpuwhich reads from the output ring buffer viaShmTransport::Recv - Deserialize: A
LocalLoadTaskArchivedeserializes the output fields viatask->SerializeOut(ar) - Completion: The GPU spins on
FUTURE_COMPLETEflag (should already be set after Recv completes)
CPU → GPU Send Path (SendToGpu)
- Send() detects
LocalGpuBcast/ToLocalGpurouting and callsSendToGpu() - SendToGpu allocates a
FutureShmin GPU-accessible pinned host memory viaAllocateGpuBuffer() - Serializes task input via
task->SerializeIn(ar)+ShmTransport::Sendinto the input ring buffer - Pushes a
Future<Task>ontocpu2gpu_queues_[gpu_id] - Returns a
Future<TaskT>with the pinned-memory FutureShm
CPU → GPU Receive Path (Wait runtime path)
- Wait() enters the runtime spin-wait loop on
FUTURE_COMPLETE - The GPU orchestrator pops the task, dispatches to
GpuRuntime, serializes output into the FutureShm's output ring buffer, calls__threadfence_system(), and setsFUTURE_COMPLETE - Wait() detects
FUTURE_COPY_FROM_CLIENTflag and non-zerooutput_.copy_space_size_, then deserializes output viaShmTransport::Recv+task->SerializeOut(ar)
Client Process → Server CPU Path (SendShm/SendZmq)
When a pure client uses PoolQuery::Local(), the task bypasses all GPU queues and goes through the standard client transport:
- SendShm: Allocates
FutureShmin server's SHM segment withcopy_spacering buffer. Enqueues toworker_queues_lane (mapped byClientMapTask). Serializes task viaSaveTaskArchive+ShmTransport::Sendinto the ring buffer. - Server worker: Dequeues
FutureShm, checksFUTURE_COPY_FROM_CLIENTflag, reads fromcopy_spacering buffer viaShmTransport::Recv, deserializes, executes the CPU handler (e.g.,result = test_value * 2 + gpu_id). - Response: Worker serializes output back through the ring buffer, sets
FUTURE_COMPLETE. - Client Wait(): Spins on
FUTURE_COMPLETE, deserializes output fromcopy_space.
This path uses the CPU runtime handler, not the GPU handler. The computation differs:
- CPU handler:
result_value = test_value * 2 + gpu_id - GPU handler:
result_value = test_value * 3 + gpu_id
Wire Format
GPU and CPU use compatible but different archive types:
| Side | Serialization | Deserialization |
|---|---|---|
| GPU (send) | LocalSaveTaskArchive | LocalLoadTaskArchive |
| CPU (worker) | SaveTaskArchive | LoadTaskArchive |
Both produce the same binary wire format: [vector_size][elements...][msg_type][stream_data]. This is critical -- if the formats diverge, deserialization will produce corrupted data.
Client Process GPU Tests
The test file context-runtime/modules/MOD_NAME/test/test_gpu_client_process.cc verifies cross-process GPU task submission using the fork-based test pattern. A child process runs as a standalone Chimaera server (with --server-mode), while the parent connects as a pure client.
Test Infrastructure
All tests share the same server/client lifecycle managed by EnsureInitialized():
StartServerProcess(): Forks a child, setsCHI_WITH_RUNTIME=1, andexecls back into the same binary with--server-mode. The child callsCHIMAERA_INIT(kServer, true)and sleeps until killed.WaitForServer(): Polls for the SHM segment file (/tmp/chimaera_<user>/chi_main_segment_<user>) up to 50 times at 200ms intervals.- Parent client init: Sets
CHI_WITH_RUNTIME=0and callsCHIMAERA_INIT(kClient, false). DuringClientConnect, the client receives GPU queue metadata from the server. - Cleanup:
CleanupServer()sendsSIGKILLto the server process group and removes SHM files.
Test Cases
client_process_gpu_submit_local
Submits a GpuSubmitTask with PoolQuery::Local() routing from the client process. The task travels via SHM or TCP transport to the server's CPU worker, which executes the CPU handler (result = test_value * 2 + gpu_id). Verifies the result matches the expected CPU computation.
This test works regardless of GPU support — Local() routing always goes through the CPU worker.
client_process_gpu_submit_multiple
Submits 5 sequential GpuSubmitTasks with different test_value parameters (100–104), each via PoolQuery::Local(). Verifies that all return correct results, confirming the transport handles repeated task round-trips without state corruption.
client_process_gpu_queue_attachment
Verifies that ClientInitGpuQueues() successfully attached to the server's GPU queues during ClientConnect. Checks that GetToGpuQueueCount() > 0 and that GetToGpuQueue(0) and GetGpuToGpuQueue(0) return non-null pointers.
Requires HSHM_ENABLE_CUDA or HSHM_ENABLE_ROCM at compile time. Without GPU support, prints "GPU support not compiled, skipping" and passes.
client_process_gpu_submit_to_gpu
Submits a GpuSubmitTask with PoolQuery::LocalGpuBcast() routing. This exercises the SendToGpu() path: the client serializes the task into the attached GpuShmMmap queue backend and pushes it onto cpu2gpu_queue. The server's GPU orchestrator dequeues and dispatches to the module's GPU handler.
Requires GPU support compiled in and a GPU device on the server. Verifies that result = test_value * 2 + gpu_id (GPU handler uses the same formula in this test module).
client_process_register_gpu_memory
Tests cross-process GPU memory registration. The client:
- Allocates a 4 MB
GpuMallocbackend (device memory viacudaMalloc) - Reads the IPC handle from the backend's private header (GPU→host memcpy)
- Sends a
RegisterMemoryTaskwithMemoryType::kGpuDeviceMemoryto the server - The server opens the IPC handle via
GpuMalloc::shm_attach_ipc()and registers the allocator ingpu_alloc_map_
After registration, the server can resolve ShmPtrs pointing into this client-owned device memory.
Building and Running
cd /workspace/build
cmake /workspace && cmake --build . -j$(nproc)
# CRITICAL: install ALL .so files (stale modules cause serialization mismatches)
sudo cp bin/lib*.so /usr/local/lib/ && sudo ldconfig
# Run all client process GPU tests
./bin/chimaera_gpu_client_process_tests
# Run a specific test
./bin/chimaera_gpu_client_process_tests client_process_gpu_submit_local
Note: Tests 3–5 (gpu_queue_attachment, gpu_submit_to_gpu, register_gpu_memory) require the test binary to be compiled with CUDA/ROCm support. Without it, they skip gracefully. Tests 1–2 work with any build configuration.
Performance
GPU→GPU Task Round-Trip Latency
The following latency was measured using the in-process benchmark (bench_gpu_runtime) with 1 client block × 1 active thread (sequential AsyncGpuSubmit + Wait() loop):
Allocator (CHI_GPU_HEAP_T) | Avg Latency | Throughput (1 thread) | Notes |
|---|---|---|---|
BuddyAllocator (default) | ~200 µs | ~5,000 tasks/sec | Managed pool; per-allocation free; no exhaustion |
MallocAllocator (device heap) | ~400 µs | ~2,500 tasks/sec | CUDA device malloc()/free(); serialized across warps |
BuddyAllocator backed by GpuMalloc device memory is the default because it is 2× faster than CUDA device heap malloc() and does not exhaust memory across arbitrarily many tasks.
How Latency Is Measured
The benchmark timer starts on the CPU immediately after ResumeGpuOrchestrator() and ends when the CPU polls *d_done == 1 (set by the GPU after the last task completes). The formula is:
avg_latency = elapsed_ms * 1000 / total_tasks (µs per task)
This captures the full GPU→GPU round-trip per task:
AsyncGpuSubmit: serialize inputs intoFutureShmring buffer, push togpu2gpu_queue- Orchestrator polls, dispatches
DispatchTask: deserialize →Run()→ serialize output FUTURE_COMPLETEwritten with__threadfence_system()+ system-scope atomicfuture.Wait()spin exits,RecvGpudeserializes output
Note: the first task's queue-push overhead is excluded because the GPU kernel starts before the CPU timer. CPU polling overhead (100 µs sleep per iteration) adds at most ~0.1 µs per task across 1,000 tasks and is negligible.
Memory Limits
The default GpuShmMmap orchestrator scratch (HSHM_DEFAULT_ALLOC_GPU_T = ArenaAllocator) is reset after each completed task (alloc->Reset()), so it never exhausts regardless of task count. The BuddyAllocator heap (CHI_GPU_HEAP_T) reclaims memory individually as serialization vectors go out of scope.
Troubleshooting
CUDA Error 700 (Illegal Memory Access)
Usually caused by GPU stack overflow during deep serialization template chains. Increase the stack size:
cudaDeviceSetLimit(cudaLimitStackSize, 131072);
CUDA Error 715 (Illegal Instruction)
Caused by using system-scope atomics on pinned host memory, or by cross-library virtual function calls. Solutions:
- Use device-scope atomics (
atomicAdd,atomicOr) with explicit__threadfence_system()at strategic points - Ensure all GPU containers are allocated within the orchestrator's CUDA module context (the autogen system handles this)
Silent data corruption (kernel completes but wrong values)
Check that all serialization functions are marked HSHM_CROSS_FUN. Without __device__ annotation, NVCC's SFINAE dispatch silently skips serialization, consuming 0 bytes and misaligning the stream. The compiler produces no error.
Ring buffer deadlock (kernel hangs)
The GPU producer and CPU consumer share the ring buffer. If the buffer fills, the producer spins waiting for the consumer. Ensure:
- The CPU worker is running and polling the GPU queue lanes
RegisterGpuQueueandAssignGpuLanesToWorkerwere called before the kernel launch- The copy space is large enough for the serialized payload
CPU→GPU Wait() timeout
If future.Wait() times out after SendToGpu():
- Verify the GPU orchestrator is running (not paused)
- Verify the module's GPU container is registered (check for "Registered GPU container" log message)
- Verify the call site is compiled with
HSHM_ENABLE_CUDA=1-- if compiled withHSHM_ENABLE_CUDA=0, theSend()GPU interceptor is absent and the task goes through normal CPU routing, which cannot complete a GPU-bound future
Client process GPU routing fails
If a pure client process (CHIMAERA_INIT(kClient, false)) tries to use LocalGpuBcast or ToLocalGpu and gets empty futures:
- Binary not compiled with GPU support:
ClientInitGpuQueues()is skipped,cpu2gpu_queues_is empty. UsePoolQuery::Local()as a fallback (task goes to the server's CPU handler). - Stale installed libraries: After rebuilding with GPU queue extensions, all
.sofiles must be reinstalled — not justlibchimaera_cxx.so. Stale module.sofiles (e.g.,libchimaera_admin_runtime.so) causeClientConnectserialization mismatches because the server loads modules from/usr/local/lib/. Fix:sudo cp /workspace/build/bin/lib*.so /usr/local/lib/ && sudo ldconfig - Server has no GPUs:
ClientConnectreturnsnum_gpus_ == 0, so no queues are attached. This is expected behavior.
Kernel launch failure (-201)
Check cudaGetLastError() after the kernel launch. Common causes:
- Too many threads per block (ArenaAllocator table won't fit in memory)
- Missing
CHIMAERA_GPU_INIT(uninitialized shared memory access)
ShmPtr resolution failure on host
The CPU worker must be able to resolve ShmPtrs that point into GPU-allocated pinned memory. Call RegisterGpuAllocator before launching the kernel:
CHI_IPC->RegisterGpuAllocator(backend_id, gpu_backend.data_,
gpu_backend.data_capacity_);
GPU orchestrator SM starvation
The persistent orchestrator occupies all SMs (32 blocks by default). To launch other GPU kernels:
- Call
CHI_IPC->PauseGpuOrchestrator() - Launch your kernel on a separate stream
- Synchronize your stream (not
cudaDeviceSynchronize) - Call
CHI_IPC->ResumeGpuOrchestrator()