Multi-testbed experiment#

Prerequisite: Complete Your first experiment before starting this tutorial.

This tutorial walks through running a Pegasus workflow experiment that spans FABRIC and Chameleon Edge simultaneously, with HTCondor as the deployment layer.

What you will build#

You will run a two-task Pegasus workflow distributed across execute nodes on two separate testbeds — FABRIC and Chameleon Edge — within a single HTCondor pool. One FABRIC node acts as the HTCondor submit node, execute node, and central manager, FABRIC and Chameleon Edge each contribute execute nodes, and HTCondor schedules tasks across all of them transparently.

The workflow itself is minimal: Task A produces an output file, Task B consumes it. The point is not the computation but the infrastructure — by the end you will have provisioned nodes on two different testbeds, configured them into a single HTCondor pool, submitted a Pegasus workflow, and collected results, all from one config file and three commands.

Prerequisites#

Before you begin, confirm all of the following:

  1. You have completed Tutorial 1 and are comfortable with the basic kiso up / run / down workflow.

  2. You have a FABRIC account with an active project allocation. See Set up on FABRIC.

  3. You have a Chameleon account with an active project allocation. See Set up on Chameleon Edge.

  4. Public IP addresses are available on both testbeds for the HTCondor submit and central manager nodes. This is a hard requirement for multi-testbed HTCondor, not a configuration preference. See Components — HTCondor for why this is necessary.

Step 1 — Configure the experiment for two testbeds#

Create experiment.yml:

name: multi-testbed-workflow

sites:
  - kind: fabric
    rc_file: secrets/fabric_rc
    walltime: "01:00:00"
    resources:
      machines:
        - labels: [submit, central-manager, execute]
          flavour: small
          number: 1
      networks:
        - labels: [internal]
          kind: FABNetv4
          site: UCSD
          nic:
            kind: SharedNIC
            model: ConnectX-6

  - kind: chameleon-edge
    rc_file: secrets/edge-app-cred-oac-edge-openrc.sh
    walltime: "01:00:00"
    lease_name: kiso-multi-testbed
    resources:
      machines:
        - labels:
            - execute
          machine_name: raspberrypi4-64
          count: 1
          container:
            name: execute
            image: rockylinux:9

deployment:
  htcondor:
    - kind: central-manager
      labels: [central-manager]
    - kind: submit
      labels: [submit]
    - kind: execute
      labels: [execute]

experiments:
  - kind: pegasus
    name: distributed-workflow
    main: ./workflow.py
    submit_node_labels: [submit]
    timeout: 3600
    outputs:
      - labels:
          - submit
        src: outputs/result-a.txt
        dst: output
      - labels:
          - submit
        src: outputs/result-b.txt
        dst: output

Notice that the execute label appears on nodes from both testbeds. HTCondor treats them as a single pool regardless of where they physically run.

Step 2 — Write the Pegasus workflow script#

Create workflow.py alongside experiment.yml:

#!/usr/bin/env python3

from pathlib import Path
from Pegasus.api import *

wf = Workflow("multi-testbed-workflow")

# --- Sites ---

sc = SiteCatalog()

WORK_DIR = Path.cwd().resolve()

shared_scratch_dir = str(WORK_DIR / "scratch")
local_storage_dir = str(WORK_DIR / "outputs")

local = Site("local").add_directories(
    Directory(Directory.SHARED_SCRATCH, shared_scratch_dir).add_file_servers(
        FileServer("file://" + shared_scratch_dir, Operation.ALL)
    ),
    Directory(Directory.LOCAL_STORAGE, local_storage_dir).add_file_servers(
        FileServer("file://" + local_storage_dir, Operation.ALL)
    ),
)

condorpool_amd = (
    Site("condorpool_amd", arch=Arch.X86_64)
    .add_pegasus_profile(style="condor")
    .add_pegasus_profile(auxillary_local="true")
    .add_condor_profile(universe="vanilla")
)

condorpool_arm = (
    Site("condorpool_arm", arch=Arch.AARCH64)
    .add_pegasus_profile(style="condor")
    .add_pegasus_profile(auxillary_local="true")
    .add_condor_profile(universe="vanilla")
)

sc.add_sites(local, condorpool_amd, condorpool_arm)

sc.write()

# --- Transformations ---

task_a = Transformation(
    "task-a",
    site="condorpool_amd",
    pfn="/usr/bin/pegasus-keg",
    is_stageable=False,
    arch=Arch.X86_64,
    os_type=OS.LINUX,
)
task_b = Transformation(
    "task-b",
    site="condorpool_arm",
    pfn="/usr/bin/pegasus-keg",
    is_stageable=False,
    arch=Arch.AARCH64,  # Ensure it runs on Chameleon Edge
    os_type=OS.LINUX,
)

tc = TransformationCatalog().add_transformations(task_a, task_b).write()

# --- Jobs ---

# Task A — runs on any execute node
result_a = File("result-a.txt")
task_a = Job(task_a)
task_a.add_args("-o", result_a)
task_a.add_outputs(result_a)

# Task B — depends on Task A
result_b = File("result-b.txt")
task_b = Job(task_b)
task_b.add_args("-i", result_a, "-o", result_b)
task_b.add_inputs(result_a)
task_b.add_outputs(result_b)
task_b.add_condor_profile(
    requirements='TARGET.Arch == "AARCH64"'
)  # → Ensure it runs on Chameleon Edge
wf.add_jobs(task_a, task_b)

wf.write("workflow.yml").plan(sites=["condorpool_amd", "condorpool_arm"], submit=True)

This workflow has two dependent tasks. HTCondor schedules them across the available execute nodes on FABRIC and Chameleon Edge.

Step 3 — Run the Pegasus workflow#

Provision all resources and install HTCondor on all nodes:

kiso up experiment.yml

Kiso provisions nodes on both FABRIC and Chameleon Edge, installs HTCondor, and configures the pool so all execute nodes (from both testbeds) register with the central manager.

Run the Pegasus workflow:

kiso run experiment.yml

Kiso submits the Pegasus workflow to the HTCondor submit node. Pegasus schedules individual tasks to execute nodes — Kiso and HTCondor handle the cross-testbed routing transparently.

Step 4 — Collect results and view them#

ls -l output/

# Workflow outputs
cat output/result-a.txt
cat output/result-b.txt

# Pegasus workflow statistics
cat output/distributed-workflow/instance-0/run0001/statistics/summary.txt

All results are collected into the same output directory structure. See Collect and export results for details on the output format and how to export results.

Tear down all resources:

kiso down experiment.yml

This destroys resources on both FABRIC and Chameleon.

What you have accomplished#

🎉 Congratulations — what you just did is genuinely impressive. You have run a Pegasus workflow across two separate testbeds from a single Kiso config file. Specifically, you have:

  • ✅ Provisioned nodes on FABRIC and Chameleon simultaneously with kiso up

  • ✅ Configured all those nodes into a single HTCondor pool spanning two different network domains

  • ✅ Submitted a Pegasus workflow that scheduled tasks across execute nodes on both testbeds

  • ✅ Collected results into a single output directory

This is a non-trivial distributed systems achievement. 🏆 Doing it manually — provisioning two testbeds, installing and configuring HTCondor on each, establishing cross-testbed connectivity, submitting a workflow, and retrieving outputs — would require hours of careful work and deep familiarity with both testbed APIs. You did it with a config file and three commands. 🚀 That is exactly what Kiso is designed to make possible, and you have now seen it work at scale.

What’s next#