GPU Client Kernels

This guide describes how CUDA, ROCm, and SYCL kernels submit tasks to the Clio CPU runtime. The GPU subsystem is producer-only: kernels never execute Module code themselves. They populate pre-allocated task structures in registered device-memory backends, push them onto a per-device gpu2cpu_queue, and wait for the CPU runtime to write back results.

If you are migrating from an older version of this guide that documented a GPU work orchestrator, persistent GPU kernels, or CPU-to-GPU dispatch, see the Removed APIs section at the end.

Overview

              GPU device                            CPU runtime
   ┌────────────────────────────┐         ┌──────────────────────────────┐
   │ kernel (producer)          │         │ Worker::ProcessNewTaskGpu     │
   │   1. mutate POD task       │         │   1. pop gpu::Future<Task>    │
   │   2. CHI_IPC->Send(task_fp)│ ──push──▶  2. resolve Task + FutureShm  │
   │   3. future.Wait()         │         │   3. RouteTask -> chimod      │
   │                            │   ◀──   │   4. RuntimeSend writes back  │
   └────────────────────────────┘  signal │      POD output + sets        │
                                          │      FUTURE_COMPLETE          │
                                          └──────────────────────────────┘
                                  per-device gpu2cpu_queue (MPSC ring)

The only GPU-related routing mode is chi::RoutingMode::ToLocalCpu, constructed with chi::PoolQuery::ToLocalCpu(parallelism). Every other routing decision (Local, DirectId, DirectHash, Broadcast, Physical, Dynamic, Range) routes through CPU workers exactly as it does for non-GPU clients.

Architecture

Server-side state

CHIMAERA_INIT(ChimaeraMode::kServer) (or kRuntime, the alias) brings up the CPU runtime, which enumerates GPU devices and calls chi::gpu::IpcManager::ServerInitGpuQueues(queue_depth). For each physical GPU the runtime allocates:

A pinned-host backend large enough for the per-device GpuTaskQueue (declared as using GpuTaskQueue = ctp::ipc::multi_mpsc_ring_buffer<gpu::Future<Task>, CHI_QUEUE_ALLOC_T>).
A PerGpuDeviceState struct holding the queue plus a client_backends map (AllocatorId -> ClientBackend).

A dedicated CPU worker polls each lane via Worker::ProcessNewTaskGpu. When it pops a gpu::Future<Task>, it:

Resolves the task and FutureShm ShmPtrs into host-readable addresses (direct dereference for pinned host / managed UVM, D2H cudaMemcpy for pure device memory).
Copies the POD task bytes into a per-thread scratch slot if it had to D2H-copy them, then calls Container::FixupAfterCopy so SSO/SVO pointers inside the task POD are valid host-side.
Calls IpcManager::RouteTask so the task runs on the normal CPU worker pipeline.
Once the chimod handler returns, IpcGpu2Cpu::RuntimeSend writes the mutated POD bytes back to the original device address (H2D cudaMemcpy when needed) and sets FUTURE_COMPLETE on the device-side gpu::FutureShm. The kernel's poll loop sees the flag and future.Wait() returns.

Client-process limitations

Pure client processes (CHIMAERA_INIT(ChimaeraMode::kClient)) do not attach to GPU queues. There is no ClientInitGpuQueues on chi::gpu::IpcManager; the only entry point is ServerInitGpuQueues. To submit tasks from a GPU kernel, the producing process must be the runtime process (the same process that called CHIMAERA_INIT(kServer)).

Initialization

Host side

#include <clio_runtime/clio_runtime.h>
#include <clio_runtime/singletons.h>
#include <clio_runtime/gpu/gpu_ipc_manager.h>

int main() {
  if (!chi::CHIMAERA_INIT(chi::ChimaeraMode::kServer)) {
    return 1;
  }

  auto *ipc = CHI_CPU_IPC;
  if (!ipc->GetGpuIpcManager() || ipc->GetGpuQueueCount() == 0) {
    std::fprintf(stderr, "no GPU queues initialized\n");
    return 1;
  }

  // ... allocate backends, launch kernels, run tasks ...

  chi::CHIMAERA_FINALIZE();
}

CHIMAERA_INIT is declared as:

namespace chi {
enum class ChimaeraMode { kClient, kServer, kRuntime = kServer };
bool CHIMAERA_INIT(ChimaeraMode mode,
                   bool default_with_runtime = false,
                   bool is_restart = false);
}

Environment variable CHI_WITH_RUNTIME overrides default_with_runtime. There is no co-located mode beyond kServer.

Kernel side

Every kernel that submits tasks must expand CHIMAERA_GPU_INIT(gpu_info, ipc_ptr) exactly once at entry. The macro has two backend-specific definitions in clio_runtime/gpu/gpu_ipc_manager.h:

CUDA / ROCm: places an IpcManager instance in __shared__ memory, calls ClientInitGpu(gpu_info) on lane 0, and synchronizes the block. ipc_ptr is ignored in this backend (pass nullptr).
SYCL: takes a USM-allocated IpcManager * as ipc_ptr, calls ClientInitGpu(gpu_info) on it, and stores the pointer in a kernel-scope variable named g_ipc_manager_ptr so CHI_IPC resolves correctly inside the kernel body.

In both expansions the macro introduces two kernel-scope names:

g_ipc_manager_ptr — a chi::gpu::IpcManager *.
g_ipc_manager — a reference to *g_ipc_manager_ptr.

You can use either directly, or use the CHI_IPC macro (which resolves to one of them depending on the build). The stress test uses g_ipc_manager_ptr explicitly to dodge an NVCC two-pass name-resolution quirk; see context-runtime/test/unit/gpu/test_gpu_kernel_stress_gpu.cc for the reasoning.

Memory model

Backend kinds

chi::gpu::IpcManager::MemKind selects how the runtime allocates and resolves a registered backend:

MemKind	Underlying allocator (CUDA / ROCm / SYCL)	CPU visibility
`kPinnedHost`	`cudaHostAlloc` / `hipHostMalloc` / `sycl::malloc_host`	Direct (UVA)
`kManagedUvm`	`cudaMallocManaged` / `hipMallocManaged` / `sycl::malloc_shared`	Direct (page-fault)
`kDeviceMem`	`cudaMalloc` / `hipMalloc` / `sycl::malloc_device`	D2H `cudaMemcpy` of POD bytes

kPinnedHost is the lowest-latency option for small tasks. kDeviceMem lets you keep blob payloads on the GPU; the worker handles the round-trip POD copy automatically.

Allocating a backend

auto *ipc = CHI_CPU_IPC;
char *base = nullptr;
ctp::ipc::AllocatorId alloc_id = ipc->AllocateAndRegisterGpuBackend(
    /*gpu_id=*/0,
    chi::gpu::IpcManager::MemKind::kPinnedHost,
    /*bytes=*/kBackendBytes,
    &base);
if (alloc_id.IsNull()) { /* allocation failed */ }
// ... use base ...
ipc->FreeGpuBackend(/*gpu_id=*/0, alloc_id);

Declared in clio_runtime/ipc_manager.h:

ctp::ipc::AllocatorId AllocateAndRegisterGpuBackend(
    u32 gpu_id, gpu::IpcManager::MemKind kind, size_t bytes,
    char **out_base);
void FreeGpuBackend(u32 gpu_id, const ctp::ipc::AllocatorId &alloc_id);

For kPinnedHost and kManagedUvm, *out_base is a CPU-readable pointer; the same address is also dereferenceable by the GPU. For kDeviceMem, *out_base is the raw device pointer; the host cannot dereference it.

Slot layout: Task + FutureShm

Inside a backend, each task occupies a slot of sizeof(TaskT) + sizeof(chi::gpu::FutureShm) bytes:

+------------------- slot i (one task) -------------------+
| TaskT POD                          | gpu::FutureShm     |
| (e.g. GpuSubmitTask, PutBlobTask)  | task_size_, flags_ |
+--------------------------------------------------------+
   ^                                  ^
   task_addr                          task_addr + sizeof(TaskT)

chi::gpu::FutureShm is small (see clio_runtime/gpu/future.h):

struct FutureShm {
  static constexpr u32 FUTURE_COMPLETE = 1;
  u32 task_size_;
  ctp::abitfield32_t flags_;
  CTP_CROSS_FUN FutureShm() : task_size_(0) { flags_.Clear(); }
  CTP_CROSS_FUN void Reset(u32 task_size) {
    task_size_ = task_size;
    flags_.Clear();
  }
};

The host pre-constructs both pieces in place. The kernel later flips POD input fields and submits.

char *task_addr = base + i * (sizeof(TaskT) + sizeof(chi::gpu::FutureShm));
char *fshm_addr = task_addr + sizeof(TaskT);

auto *task = new (task_addr) TaskT(
    chi::CreateTaskId(), pool_id, chi::PoolQuery::ToLocalCpu(),
    /* task ctor args ... */);
task->pod_size_ = static_cast<chi::u32>(sizeof(TaskT));   // critical
new (fshm_addr) chi::gpu::FutureShm();

Failing to set pod_size_ is a common bug: the CPU worker reads gpu::FutureShm::task_size_ to know how many bytes to D2H-copy and how many to write back. If the kernel does not call Reset(sizeof(TaskT)) (or you do not stamp task_size_ directly), the worker logs gpu::FutureShm.task_size_=0 — kernel did not call Reset(sizeof(TaskT)) before Send and drops the task. IpcGpu2Cpu::ClientSend calls Reset for you, so as long as you use it (via CHI_IPC->Send) you are covered.

Handing slots to the kernel

The kernel needs a ctp::ipc::FullPtr<TaskT> for each slot it intends to submit. Build them on the host and stage them where the kernel can read them (a pinned-host array works for CUDA/ROCm; SYCL kernels capture USM pointers directly):

ctp::ipc::FullPtr<TaskT> fp;
fp.shm_.alloc_id_ = alloc_id;            // backend handle
fp.shm_.off_ = task_off;                 // byte offset in backend
fp.ptr_ = task;                          // host-side raw pointer

For kDeviceMem backends the kernel still uses FullPtr with alloc_id set to null and off_ carrying the raw device address; the CPU worker detects this via chi::g_is_device_pointer and uses cudaMemcpy to read the POD.

Writing a GPU-compatible task

A task struct that can be created from the host and submitted by a kernel must satisfy:

POD layout — no std::string, std::vector, raw pointers into host memory, etc. SSO types (e.g. chi::priv::string with CTP_CROSS_FUN ctors) are acceptable, but only if the chimod registers a FixupAfterCopy for that task so the worker can rebase the SSO data_ pointer after the D2H copy.
All constructors annotated CTP_CROSS_FUN (__host__ __device__).
SerializeIn and SerializeOut annotated CTP_CROSS_FUN.
A unique method id assigned to method_ in the constructor.

Real example from chimaera::MOD_NAME::GpuSubmitTask (see context-runtime/modules/MOD_NAME/include/clio_runtime/MOD_NAME/MOD_NAME_tasks.h):

struct GpuSubmitTask : public chi::Task {
  IN chi::u32 gpu_id_;
  IN chi::u32 test_value_;
  INOUT chi::u32 result_value_;
  OUT chi::u32 counter_value_;

  CTP_CROSS_FUN GpuSubmitTask()
      : chi::Task(), gpu_id_(0), test_value_(0), result_value_(0),
        counter_value_(0) {}

  CTP_CROSS_FUN explicit GpuSubmitTask(
      const chi::TaskId &task_node,
      const chi::PoolId &pool_id,
      const chi::PoolQuery &pool_query,
      chi::u32 gpu_id,
      chi::u32 test_value)
      : chi::Task(task_node, pool_id, pool_query, 25),
        gpu_id_(gpu_id), test_value_(test_value),
        result_value_(0), counter_value_(0) {
    task_id_ = task_node;
    pool_id_ = pool_id;
    method_ = Method::kGpuSubmit;
    task_flags_.Clear();
    pool_query_ = pool_query;
  }

  template <typename Archive>
  CTP_CROSS_FUN void SerializeIn(Archive &ar) {
    Task::SerializeIn(ar);
    ar(gpu_id_, test_value_, result_value_);
  }

  template <typename Archive>
  CTP_CROSS_FUN void SerializeOut(Archive &ar) {
    Task::SerializeOut(ar);
    ar(result_value_, counter_value_);
  }

  // ... Copy/Aggregate ...
};

Writing the CPU-side handler

The runtime side is plain CPU code: implement Runtime::GpuSubmit (or whatever name your method takes) just like any other CPU handler. The Module dispatcher does not distinguish GPU-produced tasks from CPU-produced ones.

chi::TaskResume Runtime::GpuSubmit(
    ctp::ipc::FullPtr<GpuSubmitTask> task, chi::RunContext &rctx) {
  CHI_TASK_BODY_BEGIN
  // Formula: result = test_value * 2 + gpu_id
  task->result_value_ = (task->test_value_ * 2) + task->gpu_id_;
  (void)rctx;
  CHI_CO_RETURN;
  CHI_TASK_BODY_END
}

After this returns, RuntimeSend writes the mutated task POD back to the device-side slot and signals FUTURE_COMPLETE.

Routing

Inside a kernel, the only meaningful routing mode is chi::PoolQuery::ToLocalCpu(parallelism). From clio_runtime/pool_query.h:

static CTP_CROSS_FUN PoolQuery ToLocalCpu(u32 parallelism = 32) {
  PoolQuery query;
  query.routing_mode_ = RoutingMode::ToLocalCpu;
  query.parallelism_ = parallelism;
  return query;
}

parallelism is informational — it lets the scheduler hint at how many lanes the producer expects to keep busy.

When the task arrives on the CPU side, IpcManager::RouteTask resolves it like any other task: it consults the pool's container set, picks a container, and either runs the task inline on the current worker or enqueues it on the destination worker's lane.

Removed APIs

The following names appear in older code and documentation but have been removed. They will not compile against the current headers; if you have code that uses them, the migration is to recast the call as a producer-only submission via ToLocalCpu and have the chimod do the work on the CPU side.

Removed name	Notes
`RoutingMode::ToLocalGpu` / `PoolQuery::ToLocalGpu(gpu_id)`	CPU→GPU dispatch removed
`RoutingMode::LocalGpuBcast` / `PoolQuery::LocalGpuBcast()`	CPU→GPU broadcast removed
`IpcManager::RouteToGpu`	See deletion note in `context-runtime/src/ipc_manager.cc` near line 2684
`LaunchGpuOrchestrator`, persistent GPU kernel	GPU orchestrator removed
`ClientInitGpuQueues` on `chi::gpu::IpcManager`	Pure-client processes have no GPU surface
`CHI_CLIENT_GPU_INIT` macro	Replaced by `CHIMAERA_GPU_INIT(gpu_info, ipc_ptr)`
`CHIMAERA_GPU_ORCHESTRATOR_INIT` macro	Orchestrator no longer exists
`ChimaeraMode::kColocated`	Use `kServer` (the runtime process is the producer)

chi::WorkOrchestrator still exists as a CPU thread-pool singleton — that is unrelated to the removed GPU orchestrator. Do not confuse the two.

CMake integration

The runtime build has three independent GPU options, declared in the top-level CMakeLists.txt:

option(CLIO_CORE_ENABLE_CUDA "Enable CUDA support" OFF)
option(CLIO_CORE_ENABLE_ROCM "Enable ROCm support" OFF)
option(CLIO_CORE_ENABLE_SYCL "Enable Intel GPU support via SYCL/oneAPI (icpx -fsycl)" OFF)

When any of these is ON, the corresponding CLIO_CTP_ENABLE_* is forced on as well, which defines the CTP_ENABLE_CUDA, CTP_ENABLE_ROCM, or CTP_ENABLE_SYCL preprocessor macros consumed by the GPU headers.

Configure the project:

# CUDA
cmake -S . -B build -DCLIO_CORE_ENABLE_CUDA=ON

# ROCm
cmake -S . -B build -DCLIO_CORE_ENABLE_ROCM=ON

# SYCL (Intel oneAPI)
cmake -S . -B build -DCLIO_CORE_ENABLE_SYCL=ON \
    -DCMAKE_CXX_COMPILER=icpx

The test tree uses three helper functions (add_cuda_executable, add_rocm_gpu_executable, add_sycl_executable) to build the same .cc files with the right compiler. See context-runtime/test/unit/CMakeLists.txt for usage examples.

For your own kernels you can either:

Put the kernel in a *_gpu.cc source file and compile it with NVCC / HIPCC, or
Use a SYCL .cc file that includes <sycl/sycl.hpp> and is built with icpx -fsycl.

Either way, link against the runtime library (libclio_runtime) and any chimod client libraries you need. The headers you include from a kernel file are the same ones host code uses; the difference is purely the compiler driver.

Worked example

This walks through the canonical pattern used by test_gpu_kernel_stress_*. It launches N blocks (CUDA / ROCm) or N single_task submissions (SYCL), each of which mutates one task POD and submits it.

1. Bring up the runtime and pool

#include <clio_runtime/clio_runtime.h>
#include <clio_runtime/singletons.h>
#include <clio_runtime/gpu/gpu_ipc_manager.h>
#include <clio_runtime/MOD_NAME/MOD_NAME_client.h>
#include <clio_runtime/MOD_NAME/MOD_NAME_tasks.h>

constexpr chi::u32 kNumTasks = 64;
constexpr size_t kSlotBytes =
    sizeof(chimaera::MOD_NAME::GpuSubmitTask) + sizeof(chi::gpu::FutureShm);
constexpr size_t kBackendBytes = kNumTasks * kSlotBytes + 256;

chi::PoolId pool_id(20002, 1);

REQUIRE(chi::CHIMAERA_INIT(chi::ChimaeraMode::kServer));

auto *ipc = CHI_CPU_IPC;
REQUIRE(ipc->GetGpuIpcManager() != nullptr);
REQUIRE(ipc->GetGpuQueueCount() >= 1u);

chimaera::MOD_NAME::Client client(pool_id);
{
  using CreateTask = chimaera::MOD_NAME::CreateTask;
  using CreateParams = chimaera::MOD_NAME::CreateParams;
  auto task = ipc->NewTask<CreateTask>(
      chi::CreateTaskId(), chi::kAdminPoolId,
      chi::PoolQuery::Dynamic(),
      CreateParams::chimod_lib_name,
      std::string("gpu_demo_pool"),
      pool_id, &client);
  ipc->Send(task).Wait();
}

2. Allocate a backend and place tasks

const chi::u32 gpu_id = 0;
char *base = nullptr;
auto alloc_id = ipc->AllocateAndRegisterGpuBackend(
    gpu_id, chi::gpu::IpcManager::MemKind::kPinnedHost,
    kBackendBytes, &base);
REQUIRE(!alloc_id.IsNull());

using TaskT = chimaera::MOD_NAME::GpuSubmitTask;
std::vector<ctp::ipc::FullPtr<TaskT>> handles;
handles.reserve(kNumTasks);
for (chi::u32 i = 0; i < kNumTasks; ++i) {
  size_t off = static_cast<size_t>(i) * kSlotBytes;
  char *task_addr = base + off;
  char *fshm_addr = task_addr + sizeof(TaskT);
  auto *t = new (task_addr) TaskT(
      chi::CreateTaskId(), pool_id, chi::PoolQuery::ToLocalCpu(),
      gpu_id, /*test_value=*/i);
  t->pod_size_ = static_cast<chi::u32>(sizeof(TaskT));
  new (fshm_addr) chi::gpu::FutureShm();

  ctp::ipc::FullPtr<TaskT> fp;
  fp.shm_.alloc_id_ = alloc_id;
  fp.shm_.off_ = off;
  fp.ptr_ = t;
  handles.push_back(fp);
}

3. Launch (CUDA / ROCm)

__global__ void DemoKernel(chi::IpcManagerGpuInfo info,
                            ctp::ipc::FullPtr<TaskT> *handles,
                            chi::u32 num) {
  CHIMAERA_GPU_INIT(info, /*ipc_ptr=*/nullptr);
  if (threadIdx.x != 0) return;
  chi::u32 slot = blockIdx.x;
  if (slot >= num) return;
  auto fp = handles[slot];
  auto fut = g_ipc_manager_ptr->Send(fp);
  fut.Wait();
  (void)g_ipc_manager;
}

ctp::ipc::FullPtr<TaskT> *handle_dev =
    ctp::GpuApi::MallocHost<ctp::ipc::FullPtr<TaskT>>(kNumTasks);
for (chi::u32 i = 0; i < kNumTasks; ++i) handle_dev[i] = handles[i];

chi::IpcManagerGpuInfo info =
    ipc->GetGpuIpcManager()->GetGpuInfo(gpu_id);
DemoKernel<<<kNumTasks, 32>>>(info, handle_dev, kNumTasks);
ctp::GpuApi::Synchronize();

4. Launch (SYCL)

sycl::queue q{sycl::gpu_selector_v};
auto *info_storage = sycl::malloc_shared<chi::IpcManagerGpuInfo>(1, q);
*info_storage = ipc->GetGpuIpcManager()->GetGpuInfo(gpu_id);

auto *ipc_storage = sycl::malloc_shared<chi::gpu::IpcManager>(1, q);
new (ipc_storage) chi::gpu::IpcManager();

auto *handle_storage =
    sycl::malloc_shared<ctp::ipc::FullPtr<TaskT>>(kNumTasks, q);
for (chi::u32 i = 0; i < kNumTasks; ++i) handle_storage[i] = handles[i];

auto *slot_storage = sycl::malloc_shared<chi::u32>(1, q);

for (chi::u32 i = 0; i < kNumTasks; ++i) {
  *slot_storage = i;
  q.submit([&](sycl::handler &cgh) {
    cgh.single_task<class chi_demo_kernel>([=]() {
      CHIMAERA_GPU_INIT(*info_storage, ipc_storage);
      auto fp = handle_storage[*slot_storage];
      auto fut = CHI_IPC->Send(fp);
      fut.Wait();
      (void)g_ipc_manager;
    });
  }).wait_and_throw();
}

Unlike CUDA / ROCm, the SYCL path expects each task submission to be its own single_task. The CPU GPU worker pops them concurrently from the multi-MPSC ring buffer regardless of producer concurrency.

5. Verify and clean up

for (chi::u32 i = 0; i < kNumTasks; ++i) {
  // chimod formula: result = test_value * 2 + gpu_id
  chi::u32 expected = (i * 2u) + gpu_id;
  REQUIRE(handles[i]->result_value_ == expected);
}

ipc->FreeGpuBackend(gpu_id, alloc_id);
chi::CHIMAERA_FINALIZE();

Tests

The reference tests live under context-runtime/test/unit/gpu/:

File	Backend	CTest name
`test_gpu_kernel_stress_gpu.cc`	CUDA	`cr_gpu_kernel_stress_cuda`
`test_gpu_kernel_stress_gpu.cc`	ROCm	`cr_gpu_kernel_stress_rocm`
`test_gpu_kernel_stress_sycl.cc`	SYCL	`cr_gpu_kernel_stress_sycl`

Both .cc files share test_gpu_kernel_stress_common.h for host-side setup. Run them via:

cd build
ctest -R cr_gpu_kernel_stress -V

For the CTE-specific GPU device-memory round-trip, see context-transfer-engine/test/unit/gpu/test_cte_devmem_putget.cc (CTest name cte_devmem_putget_cuda), which exercises the same producer-only path with kDeviceMem backends and a real Module (PutBlobTask / GetBlobTask).

Troubleshooting

`gpu::FutureShm.task_size_=0` log

Cause: the kernel pushed a slot whose FutureShm.task_size_ was zero. Either the host forgot to call Reset(sizeof(TaskT)) on the FutureShm, or the kernel bypassed CHI_IPC->Send (which calls Reset internally) and pushed onto gpu2cpu_queue directly without stamping the size.

Fix: always submit through CHI_IPC->Send(task_fp) (or IpcGpu2Cpu::ClientSend directly), and stamp task->pod_size_ = sizeof(TaskT) when you placement-new the task.

CUDA error 700 ("an illegal memory access was encountered")

Usually means the kernel dereferenced a FullPtr whose off_ pointed into a backend the runtime never registered. Re-check that:

AllocateAndRegisterGpuBackend returned a non-null AllocatorId.
The FullPtr you pass to Send carries either the returned alloc_id (when slots live inside the backend) or a null alloc_id with the raw device address in off_ (when you registered the backend as kDeviceMem and the kernel addresses it directly).

Kernel hangs in `future.Wait()`

The worker dropped or refused to dispatch the task. Inspect the runtime log for any Worker {}: ProcessNewTaskGpu: errors — common causes are task_pod_size {} exceeds scratch capacity {} (POD too large; the worker's thread-local scratch is 4 KiB), or Container not found (pool=...) (the pool was not created before the kernel launched).

`gpu2cpu_queue == nullptr` after `CHIMAERA_INIT`

ServerInitGpuQueues runs as part of CHIMAERA_INIT(kServer) only when the build includes at least one of CLIO_CORE_ENABLE_CUDA, CLIO_CORE_ENABLE_ROCM, or CLIO_CORE_ENABLE_SYCL. Verify the option is on, and that ipc->GetGpuQueueCount() > 0 before launching kernels.

SYCL `Unexpected kernel lambda size`

DPC++ requires the host pass and device pass to lay out captures identically. Touch every captured pointer in both passes — e.g. (void)info_storage; (void)ipc_storage; — even if the host pass body otherwise does nothing. See the comment in test_gpu_kernel_stress_sycl.cc for context.

context-runtime/include/clio_runtime/clio_runtime.h — init + mode enum.
context-runtime/include/clio_runtime/pool_query.h — routing modes.
context-runtime/include/clio_runtime/gpu/gpu_ipc_manager.h — the central GPU API.
context-runtime/include/clio_runtime/gpu/future.h — gpu::Future / gpu::FutureShm.
context-runtime/include/clio_runtime/gpu/gpu_info.h — IpcManagerGpuInfo.
context-runtime/include/clio_runtime/ipc/ipc_gpu2cpu.h — ClientSend / RuntimeSend.
context-runtime/src/worker.cc — Worker::ProcessNewTaskGpu.
context-runtime/src/ipc_manager.cc — AllocateAndRegisterGpuBackend (near line 2389).

Overview​

Architecture​

Server-side state​

Client-process limitations​

Initialization​

Host side​

Kernel side​

Memory model​

Backend kinds​

Allocating a backend​

Slot layout: Task + FutureShm​

Handing slots to the kernel​

Writing a GPU-compatible task​

Writing the CPU-side handler​

Routing​

Removed APIs​

CMake integration​

Worked example​

1. Bring up the runtime and pool​

2. Allocate a backend and place tasks​

3. Launch (CUDA / ROCm)​

4. Launch (SYCL)​

5. Verify and clean up​

Tests​

Troubleshooting​

gpu::FutureShm.task_size_=0 log​

CUDA error 700 ("an illegal memory access was encountered")​

Kernel hangs in future.Wait()​

gpu2cpu_queue == nullptr after CHIMAERA_INIT​

SYCL Unexpected kernel lambda size​

Related references​