Creating a Lakehouse Service: Part 2: Creating K8s Resources

Data · DuckDB · K8s

This is part two of a longer series and here is the link to the previous part.

sitrep

Ok, so after part one, we have:

  1. an init script that can be used to “pre-configure” our duckdb session
  2. a working k8s pod configuration
    • a custom duckdb-based image that runs our init script and launches UI
    • an nginx sidecar that actually exposes the duckdb UI
  3. a configured traefik IngressRoute to access the nginx exposed port

But, this is only for one pod. We want this to be “scalable”, so that many users can each have their own pod/session and are connected to the same Iceberg tables.

The API

As alluded to in the prologue/background post to this series, the ultimate goal here is to have some form of command or UI element/button that actually launches the DuckDB pod/session. I chose to have an always-on API pod running in the k8s cluster. This API would be reachable to anyone in the network (it has its own IngressRoute) and would have some pretty simple endpoints to spin up or take down these DuckDB pods & related resources.

Making an Image

The actual API portion of this (I’m using fastapi) is fairly simple. We need two endpoints:

  • POST to create the resources
  • DELETE to delete the resources

This was my first time playing with the Kubernetes Python API, and I really really enjoyed using it. The kubernetes.config.load_incluster_config() function is so cool and makes everything else easy to use.

I made different functions for the creation of the different resources. This helped me see how the yaml specs from the previous step aligned with the Python API. Example of the build_pod method, where these parameters were populated with environment variables via os.getenv().

def build_pod(
    session_id: str, username: str, catalog: str, s3_key: str, s3_secret: str
):
    labels = {
        "app": "duckui",
        "session-id": session_id,
        "duckui-user": username,
        "duckui-session": "true",
    }
    annotations = {
        "duckui.bsale.me/created_at": datetime.datetime.utcnow().isoformat() + "Z",
    }
    env = [
        client.V1EnvVar(name="SESSION_ID", value=session_id),
        client.V1EnvVar(name="USERNAME", value=username),
        client.V1EnvVar(name="S3_ACCESS_KEY_ID", value=s3_key),
        client.V1EnvVar(name="S3_SECRET_ACCESS_KEY", value=s3_secret),
        client.V1EnvVar(name="CATALOG_ENDPOINT", value=catalog),
    ]

    container = client.V1Container(
        name="duckui",
        image="bsale/duckui:latest",
        stdin=True,
        env=env,
        ports=[client.V1ContainerPort(container_port=4213)],
    )

    proxy_volume_mount = client.V1VolumeMount(
        name="nginx-conf",
        mount_path="/etc/nginx/conf.d",
    )

    proxy_volume = client.V1Volume(
        name="nginx-conf",
        config_map=client.V1ConfigMapVolumeSource(name="duckui-nginx-conf"),
    )

    proxy = client.V1Container(
        name="duckui-proxy",
        image="nginx:alpine",
        ports=[client.V1ContainerPort(container_port=8080)],
        volume_mounts=[proxy_volume_mount],
    )

    pod_spec = client.V1PodSpec(containers=[container, proxy], volumes=[proxy_volume])
    metadata = client.V1ObjectMeta(
        name=f"duckui-{session_id}",
        namespace=DUCKUI_NAMESPACE,
        labels=labels,
        annotations=annotations,
    )
    return client.V1Pod(api_version="v1", kind="Pod", metadata=metadata, spec=pod_spec)

and then, with the different build_ methods defined, I created the actual FastAPI endpoint. I used pydantic models for the request and response models, but this likely didn’t matter too much since they are pretty simple anyway (just a bunch of strings).

@app.post("/sessions", response_model=SessionResponse)
def create_session(req: SessionCreateRequest):
    session_id = uuid.uuid4().hex[:6]  # e.g. "abc123"

    pod = build_pod(
        session_id,
        username=req.username,
        s3_key=req.s3_key,
        s3_secret=req.s3_secret,
        catalog=req.catalog,
    )
    svc = build_service(session_id)
    ingress_body, host = build_ingressroute(session_id)

    try:
        # 1) pod
        core_v1.create_namespaced_pod(namespace=DUCKUI_NAMESPACE, body=pod)
        # 2) service
        core_v1.create_namespaced_service(namespace=DUCKUI_NAMESPACE, body=svc)
        # 3) ingressroute CRD
        custom_api.create_namespaced_custom_object(
            group="traefik.io",
            version="v1alpha1",
            namespace=TLS_SECRET_NAMESPACE,
            plural="ingressroutes",
            body=ingress_body,
        )
    except Exception as e:
        # attempt cleanup here if something failed mid-way
        raise HTTPException(status_code=500, detail=str(e))

    url = f"https://{host}"
    return SessionResponse(session_id=session_id, url=url)

This allowed me to define my Dockerfile as: (I’m using uv for package management, so that is why I’m using the uv image as the base)

FROM ghcr.io/astral-sh/uv:python3.10-alpine

COPY pyproject.toml ./

WORKDIR /app

RUN uv sync

COPY app ./app

EXPOSE 8000
CMD ["uv", "run", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

Fun with DNS

The actual API creation was very simple and I was pleasantly surprised with how fast everything deployed. Since the actual complexities with the pod & related resources were already ironed out, this part was made much more simple.

However, the issues that I was running into were not related directly to this project! Since we are creating an IngressRoute that points specifically to the pod that we are creating, we need a DNS entry to point to that URL that we are creating (with the randomized session ID). For development purposes, making these entries manually wasn’t too much overhead.

When I started to ramp up creating and deleting the pods, however, this quickly became pretty annoying. So, as a result, I elected to make a wildcard DNS entry for *.local.bsale.me on my DNS to point to Traefik.

Creating a New Pod/Session

This is as easy as issuing a couple requests:

Creating a new pod:

curl -X POST https://duckui-api.local.bsale.me/sessions -d '{"username": "bsale", "s3_key": "<s3_key>", "s3_secret": "<s3_secret>", "catalog": "http://x.y.z.a:8181/catalog"}' -H "Content-Type: application/json"

This returns an output like

{"session_id":"977a26","url":"https://duckui-977a26.local.bsale.me"}

And then deleting the session is as simple as

curl -X DELETE https://duckui-api.local.bsale.me/sessions/977a26

Summary

I think we have a pretty extensible base here: if more parameters are needed at some point in the future, we can pretty easily add those parameters to our Pydantic models and the functions generating the different objects within the k8s cluster.

In the future, I could see making specific parameters/defaults for resource requests or having some “t-shirt” size options that would set these resources with some pre-defined presets.