GPU CTE API

The Context Transfer Engine (CTE) has no GPU-specific API surface. Every CTE entry point — AsyncPutBlob, AsyncGetBlob, AsyncGetOrCreateTag, AsyncRegisterTarget, etc. — lives behind #if CTP_IS_HOST in clio_cte::core::Client (see context-transfer-engine/core/include/clio_cte/core/core_client.h). To drive CTE from a GPU kernel you use the same producer-only path the runtime exposes for any Module: pre-allocate PutBlobTask / GetBlobTask slots on the host, submit them from the kernel via CHI_IPC->Send, and let the CPU runtime do the CTE work.

For the full producer-only model — backend kinds, slot layout, CHIMAERA_GPU_INIT, Worker::ProcessNewTaskGpu, removed APIs — start at the GPU Client Kernels guide. This page only covers the CTE-specific pieces on top of that pattern.

Headers

#include <clio_runtime/clio_runtime.h>
#include <clio_runtime/singletons.h>
#include <clio_runtime/gpu/gpu_ipc_manager.h>
#include <clio_runtime/gpu/future.h>
#include <clio_runtime/bdev/bdev_client.h>      // to register a bdev target
#include <clio_cte/core/core_client.h>
#include <clio_cte/core/core_tasks.h>
#include <clio_ctp/util/gpu_api.h>              // CTP_GPU_FUN, GpuApi::*

Host-side setup

The host bootstraps the runtime, creates the CTE pool, registers at least one storage target, and creates the tag the kernel will reference. This is identical to a non-GPU CTE program; the kernel is just an extra producer that issues task submissions later.

REQUIRE(chi::CHIMAERA_INIT(chi::ChimaeraMode::kServer));
REQUIRE(clio_cte::core::CLIO_CTE_CLIENT_INIT());

auto *cte = CLIO_CTE_CLIENT;
cte->Init(clio_cte::core::kCtePoolId);

// 1. Create the CTE pool
clio_cte::core::CreateParams params;
cte->AsyncCreate(
    chi::PoolQuery::Dynamic(),
    clio_cte::core::kCtePoolName,
    clio_cte::core::kCtePoolId, params).Wait();

// 2. Register a bdev storage target (kRam shown here)
const chi::u64 kRamCapacity = 64ULL << 20;
chi::PoolId bdev_pool_id(960, 0);
chimaera::bdev::Client bdev_client(bdev_pool_id);
bdev_client.AsyncCreate(
    chi::PoolQuery::Dynamic(),
    std::string("cte_gpu_ram"),
    bdev_pool_id,
    chimaera::bdev::BdevType::kRam,
    kRamCapacity).Wait();
cte->AsyncRegisterTarget(
    "cte_gpu_ram", chimaera::bdev::BdevType::kRam, kRamCapacity,
    chi::PoolQuery::Local(), bdev_pool_id).Wait();

// 3. Create the tag
auto tag_task = cte->AsyncGetOrCreateTag("my_gpu_tag");
tag_task.Wait();
clio_cte::core::TagId tag_id = tag_task->tag_id_;

CTE-callable methods from GPU kernels

The kernel never sees clio_cte::core::Client. Instead the host placement-news a PutBlobTask (or GetBlobTask) into a registered GPU backend; the kernel submits its FullPtr. The two Module methods most often driven from a kernel are:

Module method	Task struct	Role of the kernel
`clio_cte::core::Method::kPutBlob`	`clio_cte::core::PutBlobTask`	Producer of bytes already on the GPU
`clio_cte::core::Method::kGetBlob`	`clio_cte::core::GetBlobTask`	Consumer that wants bytes landed in GPU memory

Other CTE methods (AsyncDelBlob, AsyncReorganizeBlob, AsyncListTargets, AsyncGetTagSize, etc.) follow the same pattern; the host constructs the task POD, the kernel calls CHI_IPC->Send. There is nothing GPU-specific about them.

All CTE task constructors are CTP_CROSS_FUN, so you can also construct the POD on the device if you prefer — but the host-construct + cudaMemcpy-into-place idiom is simpler and is what the reference test uses.

Memory layout

CTE tasks carry blob bytes by reference (ctp::ipc::ShmPtr<>), not by value. You allocate two GPU backends:

A task backend holding one (PutBlobTask + FutureShm) slot plus one (GetBlobTask + FutureShm) slot. Either kPinnedHost or kDeviceMem works; the reference test uses kDeviceMem to exercise the D2H/H2D POD copy path.
A blob-data backend holding the actual bytes. For GPU producers kDeviceMem is the natural choice; the bdev runtime uses chi::DeviceAwareMemcpy to read it.

task_backend (kDeviceMem, capacity = put_slot + get_slot + slack)
+----------------------+----------------+-----------------------+--------------+
| PutBlobTask          | FutureShm      | GetBlobTask           | FutureShm    |
+----------------------+----------------+-----------------------+--------------+
^                                       ^
task_dev_base                           task_dev_base + put_slot

blob_backend (kDeviceMem, capacity = kBlobBytes)
+-----------------------------------------------------------+
| blob payload (kernel writes for Put / reads after Get)    |
+-----------------------------------------------------------+

The blob ShmPtr you stamp into the task carries the raw device address in off_ and a null alloc_id (the bdev path detects the device pointer via chi::g_is_device_pointer).

End-to-end example: Put + Get with `kDeviceMem`

The full reference is context-transfer-engine/test/unit/gpu/test_cte_devmem_putget.cc (CTest name cte_devmem_putget_cuda). This is a condensed version.

Host: allocate backends and stamp prototypes

auto *ipc = CHI_CPU_IPC;
chi::IpcManagerGpuInfo gpu_info =
    ipc->GetGpuIpcManager()->GetGpuInfo(/*gpu_id=*/0);

const chi::u32 kBlobBytes = 256;
const chi::u32 kPutSlot =
    sizeof(clio_cte::core::PutBlobTask) + sizeof(chi::gpu::FutureShm);
const chi::u32 kGetSlot =
    sizeof(clio_cte::core::GetBlobTask) + sizeof(chi::gpu::FutureShm);
const chi::u32 kTaskBackendBytes = kPutSlot + kGetSlot + 64;

char *task_dev_base = nullptr;
auto task_alloc_id = ipc->AllocateAndRegisterGpuBackend(
    /*gpu_id=*/0, chi::gpu::IpcManager::MemKind::kDeviceMem,
    kTaskBackendBytes, &task_dev_base);

char *blob_dev = nullptr;
auto blob_alloc_id = ipc->AllocateAndRegisterGpuBackend(
    /*gpu_id=*/0, chi::gpu::IpcManager::MemKind::kDeviceMem,
    kBlobBytes, &blob_dev);

// Build the blob ShmPtr (raw device address + null alloc_id).
ctp::ipc::ShmPtr<> blob_shm;
blob_shm.alloc_id_.SetNull();
blob_shm.off_ = reinterpret_cast<chi::u64>(blob_dev);

// PutBlob prototype constructed in host memory, then copied to device.
alignas(64) char put_proto[kPutSlot];
std::memset(put_proto, 0, sizeof(put_proto));
auto *put_proto_task = new (put_proto) clio_cte::core::PutBlobTask(
    chi::CreateTaskId(), clio_cte::core::kCtePoolId,
    chi::PoolQuery::ToLocalCpu(),
    tag_id, "blob_a",
    /*offset=*/chi::u64(0), static_cast<chi::u64>(kBlobBytes),
    blob_shm, /*score=*/-1.0f,
    clio_cte::core::Context(), /*flags=*/chi::u32(0));
put_proto_task->pod_size_ = sizeof(clio_cte::core::PutBlobTask);
new (put_proto + sizeof(clio_cte::core::PutBlobTask))
    chi::gpu::FutureShm();
ctp::GpuApi::Memcpy(task_dev_base, put_proto, sizeof(put_proto));

// GetBlob prototype, occupying the second slot.
alignas(64) char get_proto[kGetSlot];
std::memset(get_proto, 0, sizeof(get_proto));
auto *get_proto_task = new (get_proto) clio_cte::core::GetBlobTask(
    chi::CreateTaskId(), clio_cte::core::kCtePoolId,
    chi::PoolQuery::ToLocalCpu(),
    tag_id, "blob_a",
    /*offset=*/chi::u64(0), static_cast<chi::u64>(kBlobBytes),
    /*flags=*/chi::u32(0), blob_shm);
get_proto_task->pod_size_ = sizeof(clio_cte::core::GetBlobTask);
new (get_proto + sizeof(clio_cte::core::GetBlobTask))
    chi::gpu::FutureShm();
ctp::GpuApi::Memcpy(task_dev_base + kPutSlot, get_proto, sizeof(get_proto));

// Kernel-visible FullPtrs (raw device addresses in off_).
ctp::ipc::FullPtr<clio_cte::core::PutBlobTask> put_fp;
put_fp.shm_.alloc_id_.SetNull();
put_fp.shm_.off_ = reinterpret_cast<chi::u64>(task_dev_base);
put_fp.ptr_ = reinterpret_cast<clio_cte::core::PutBlobTask *>(task_dev_base);

ctp::ipc::FullPtr<clio_cte::core::GetBlobTask> get_fp;
get_fp.shm_.alloc_id_.SetNull();
get_fp.shm_.off_ = reinterpret_cast<chi::u64>(task_dev_base + kPutSlot);
get_fp.ptr_ = reinterpret_cast<clio_cte::core::GetBlobTask *>(
    task_dev_base + kPutSlot);

Kernel: submit and wait

__global__ void PutKernel(
    chi::IpcManagerGpuInfo info,
    ctp::ipc::FullPtr<clio_cte::core::PutBlobTask> task) {
  CHIMAERA_GPU_INIT(info, /*ipc_ptr=*/nullptr);
  if (threadIdx.x != 0) return;
  auto fut = g_ipc_manager_ptr->Send(task);
  fut.Wait();
  (void)g_ipc_manager;
}

__global__ void GetKernel(
    chi::IpcManagerGpuInfo info,
    ctp::ipc::FullPtr<clio_cte::core::GetBlobTask> task) {
  CHIMAERA_GPU_INIT(info, /*ipc_ptr=*/nullptr);
  if (threadIdx.x != 0) return;
  auto fut = g_ipc_manager_ptr->Send(task);
  fut.Wait();
  (void)g_ipc_manager;
}

The kernel does not see clio_cte::core::Client. It only sees chi::gpu::IpcManager::Send and the pre-built FullPtrs.

Host: launch and verify

// Fill blob_dev with a known pattern using a regular CUDA kernel,
// then submit the PutBlob.
FillKernel<<<...>>>(blob_dev, kBlobBytes, /*seed=*/0xC3u);
ctp::GpuApi::Synchronize();

PutKernel<<<1, 32>>>(gpu_info, put_fp);
ctp::GpuApi::Synchronize();

// Pull the return code back from device memory.
clio_cte::core::PutBlobTask put_after{};
ctp::GpuApi::Memcpy(reinterpret_cast<char *>(&put_after),
                    task_dev_base, sizeof(put_after));
REQUIRE(put_after.return_code_.load() == 0u);

// Zero the device buffer so GetBlob is provably observable.
std::vector<char> zeros(kBlobBytes, 0);
ctp::GpuApi::Memcpy(blob_dev, zeros.data(), kBlobBytes);

GetKernel<<<1, 32>>>(gpu_info, get_fp);
ctp::GpuApi::Synchronize();

// Verify the device buffer now matches the original pattern.

ipc->FreeGpuBackend(/*gpu_id=*/0, blob_alloc_id);
ipc->FreeGpuBackend(/*gpu_id=*/0, task_alloc_id);

Routing

Inside the kernel, chi::PoolQuery::ToLocalCpu() is the only meaningful routing mode for CTE — the CTE Module's containers run on CPU workers. Using any other mode from a kernel is not supported. From the host side you can use any standard pool query for CTE; the GPU path is irrelevant once you are submitting from CPU code.

`Context` quick reference

clio_cte::core::Context controls compression and placement decisions. A default-constructed Context{} is the right starting point for uncompressed I/O. Key fields:

Field	Default	Meaning
`dynamic_compress_`	0	0 = skip, 1 = static lib, 2 = dynamic selection
`compress_lib_`	0	Compression library index
`compress_preset_`	2	1 = FAST, 2 = BALANCED, 3 = BEST
`min_persistence_level_`	0	0 = volatile, 1 = temp-nonvolatile, 2 = long-term
`score` (PutBlob arg)	-1.0	-1 = auto-place, 0.0-1.0 = explicit tier

CMake integration

CTE adds nothing GPU-specific on top of the runtime's CMake setup. You enable GPU support with the same options described in the GPU Client Kernels guide:

cmake -S . -B build -DCLIO_CORE_ENABLE_CUDA=ON      # CUDA
cmake -S . -B build -DCLIO_CORE_ENABLE_ROCM=ON      # ROCm
cmake -S . -B build -DCLIO_CORE_ENABLE_SYCL=ON \    # SYCL
    -DCMAKE_CXX_COMPILER=icpx

A target that submits CTE tasks from a kernel needs to link both the CTE client library and the runtime; see the add_cuda_executable(test_cte_devmem_putget ...) block in context-transfer-engine/test/unit/CMakeLists.txt for a working setup.

Tests

File	Backend	CTest name
`context-transfer-engine/test/unit/gpu/test_cte_devmem_putget.cc`	CUDA	`cte_devmem_putget_cuda`

Run with:

cd build
ctest -R cte_devmem_putget -V

GPU Client Kernels — full producer-only model.
context-transfer-engine/core/include/clio_cte/core/core_client.h — every CTE entry point (host-only).
context-transfer-engine/core/include/clio_cte/core/core_tasks.h — PutBlobTask, GetBlobTask, Context, TagId.
context-transfer-engine/test/unit/gpu/test_cte_devmem_putget.cc — reference end-to-end test.

Headers​

Host-side setup​

CTE-callable methods from GPU kernels​

Memory layout​

End-to-end example: Put + Get with kDeviceMem​

Host: allocate backends and stamp prototypes​

Kernel: submit and wait​

Host: launch and verify​

Routing​

Context quick reference​

CMake integration​

Tests​

Related references​