Creating a Lakehouse Service: Part 1: Deploying a UI Pod

Data · DuckDB · K8s

This is part one of a longer series .

On a personal note, I really enjoyed this part of the project because it helped me solidify some knowledge and learned some new stuff in the process.

Selecting Tools

DuckDB

I knew that I could use open table formats pretty easily with some of the more popular distributed engines such as Snowflake. But, after the Small Data 2025 conference, I was particularly motivated to see how much use I could get out of DuckDB and some of its built in tooling, such as the new UI. Additionally, I have been really impressed with DuckDB’s integrations. Not only have they been focusing on compatibility with different open table formats, but they’ve been putting their engine in some pretty cool places .

I feel like I’ve tinkered with Postgres a good amount, and, since my main use case (college basketball analytics) tends to be on the more analytical side of the spectrum, I decided to put some hours in on an OLAP engine. So, duckdb it was.

Ducklake vs Iceberg

While this project would technically be viable going with either DuckLake or Iceberg as my open table format of choice (since duckdb is compatible with each of them), I decided to use Iceberg since it has more interoperability with other database engines.

I think if I were really trying to go the route of more closely duplicating the efforts of MotherDuck (hosted DuckDB), I would have chosen DuckLake, but I was curious to get my hands dirty with Iceberg and learn some of the finer details.

Lakekeeper

Since I was using Iceberg, I needed a catalog to host and manage the metadata for the tables. I went with Lakekeeper since I had tinkered with it before and already had a Postgres instance running to be its backend. Additionally, the Lakekeeper team offers a Helm chart, which was very useful since I am running this in k8s.

Lakekeeper also offers an easy-to-use (but simple) UI if needed, which I decided to put on a LoadBalancer IP locally just to see what information was surfaced there.

homelab

I’ve been meaning to write more extensively on my homelab setup I have going, but consider this blog as part of that effort. :)

Long story short, earlier this year I bought four Lenovo ThinkCentre Tinys and clustered them together using k3s since I was interested in learning about containers and Kubernetes more specifically.

This is where I’ll be deploying all of these things and why this blog is k8s specific! Also, I have minio object store deployed locally in my homelab, so all references here to s3 are actually referring to s3-compatible, self-hosted object storage via minio.

Getting to Work

Creating DuckDB Sessions

Connecting to an iceberg catalog is easy enough with DuckDB. What I set out to do here was create a “script” that we could pass to duckdb -init <script> that ran everything we needed (set the settings, attach the catalog) before launching the UI. Our container setup, then, would be pretty simple. I also wanted to make this as configurable as possible in areas that I identified some customization/personalization would be useful, but my main guiding principle was that I didn’t want users to have to run anything, I just wanted them to essentially be automatically connected to the lake/lakehouse (the catalog).

I was really impressed by the flexibility here of DuckDB’s getenv function for both SETting individual session variables and attaching the catalog.

install httpfs;
load httpfs;

install iceberg;
load iceberg;

install ui;
load ui;

SET UI_LOCAL_PORT = 4213;
SET s3_region = 'us-west-2';
SET s3_access_key_id = getenv('S3_ACCESS_KEY_ID');
SET s3_secret_access_key = getenv('S3_SECRET_ACCESS_KEY');
SET s3_url_style = 'path';
SET s3_use_ssl = false;

ATTACH 'db' AS ice (
    TYPE iceberg,
    ENDPOINT getenv('CATALOG_ENDPOINT'),
    authorization_type none --should match your lakekeeper auth settings
);

CALL start_ui_server();

I ended up having to use SET GLOBAL instead of just SET since launching the UI technically starts a session that is distinct from the prior commands/settings in the script above. Since deploying this, I have moved the s3 credentials to duckdb secrets .

I was able to deploy locally with duckdb -init script.sql to validate that I would not need to make any additional settings changes or attach anything once the UI session began.

Running the UI in a Pod

Now that I had the duckdb session prepared, I needed to actually get the “remote” session running. I did this by creating a very simple dockerfile so I could build an image to reference within my Pod definition. I’m using a pod name with a “random” string (abc123) attached to somewhat simulate how I want the “spinning up a pod” to work.

FROM duckdb/duckdb:1.4.2

COPY lake.sql /etc/lake.sql

ENTRYPOINT ["/duckdb", "-init", "/etc/lake.sql"]
apiVersion: v1
kind: Pod
metadata:
  name: duckui-abc123
  labels:
    app: duckui
    session-id: abc123
spec:
  containers:
    - name: duckui
      image: bsale/duckui:latest
      ports:
        - containerPort: 4213
      env:
        - name: S3_ACCESS_KEY_ID
          value: <secret value> # recommend actually using a k8s secret here
        - name: S3_SECRET_ACCESS_KEY
          value: <value> # recommend actually using a k8s secret here
        - name: CATALOG_ENDPOINT
          value: "http://172.x.y.z:8181/catalog"

The advantage to explicitly setting the ui_local_port in the duckdb config is that it reliably deploys to a set port, which makes the Pod config pretty easy to read.

Lessons learned: tty

At first, my pod deployment was repeatedly failing: it would show as successful for a half second and would crash with a segmentation fault.

I found that this is because duckdb expects to be run at a terminal, and the terminal remains open when the UI is open. The pod was trying to close the terminal since it ran the command(s) it was supposed to. The solution here is to use stdin and tty. I don’t love this solution since it requires force stopping the pod in the case you need to re-create it, but it does mean everything works. I’ll be looking for a better solution for this part.

apiVersion: v1
kind: Pod
metadata:
  name: duckui-abc123
  labels:
    app: duckui
    session-id: abc123
spec:
  containers:
    - name: duckui
      image: bsale/duckui:latest
      stdin: true
      tty: true
      ports:
        - containerPort: 4213
      env:
        - name: S3_ACCESS_KEY_ID
          value: <secret value? # recommend actually using a k8s secret here
        - name: S3_SECRET_ACCESS_KEY
          value: <value> # recommend actually using a k8s secret here
        - name: CATALOG_ENDPOINT
          value: "http://172.x.y.z:8181/catalog"

Lessons Learned: localhost vs 0.0.0.0

Everything to this point has been pretty smooth sailing, but then we enter networking land. Everyone’s favorite, right?

DuckDB does not take any config about which interface(s) the UI deploys to, it only deploys to localhost. There’s a GitHub issue about this (hint: the eolution is discussed within that issue). I have used sidecars before, but this was my first opportunity (read: first time it was necessary) to deploy one alongside something custom.

Simple enough, we get an nginx container running alongside our custom duckdb container and we set up our proxy.

apiVersion: v1
kind: Pod
metadata:
  name: duckui-abc123
  labels:
    app: duckui
    session-id: abc123
spec:
  containers:
    - name: duckui
      image: bsale/duckui:latest
    ### omitting rest of duckui config for brevity
    - name: ui-proxy
      image: nginx:alpine
      volumeMounts:
        - name: nginx-conf
          mountPath: /etc/nginx/conf.d
      ports:
        - name: ui
          containerPort: 8080
  volumes:
    - name: nginx-conf
      configMap:
        name: duckui-nginx-conf
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: duckui-nginx-conf
  namespace: custom
data:
  default.conf: |-
    server {
      listen 8080;

      location / {
        proxy_pass http://localhost:4213;
      }
    }

The lesson learned here: localhost is only available to that local host. Not accessible outside of that. I think I knew that from a personal computer standpoint, but didn’t really think through its implications too much as it pertained to containers and Kubernetes. So we have to add the proxy to kind of “tunnel” the external connection to what we actually care about.

Lessons learned: Secure hosts

Next, I navigated to my url/IP and found… this error:

auth0-spa-js must run on a secure origin. See https://github.com/auth0/auth0-spa-js/blob/main/FAQ.md#why-do-I-get-auth0-spa-js-must-run-on-a-secure-origin for more information.

Cool, something I recognized from that Github issue earlier. New errors are good! After some reading, I learned that localhost is a trusted origin by default. But since we were coming from outside localhost, we would need a solution here. ChatGPT recommended that I simply port-forward here, but that seemed like a very manual and hacky solution that wouldn’t “scale” to what I wanted from this project.

I use traefik for my k8s cluster. I configured an ingress route so that I could leverage https, which is a “trusted source”. This is a more repeatable, “scalable” solution, since we could make an ingress route straight to each pod later on in the project.

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: duckui-abc123
  namespace: traefik
  annotations:
    kubernetes.io/ingress.class: traefik-external
spec:
  entryPoints:
    - websecure
  routes:
    - match: Host(`duckui-abc123.local.bsale.me`)
      kind: Rule
      services:
        - name: duckui-abc123
          kind: Service
          namespace: custom
          port: 80
  tls:
    secretName: my-tls-cert

Lessons learned: Proxy headers

Simply using https fixed the error from the last section. Now, I was presented with a more obscure error:

Failed to resolve app state with user - RangeError: Offset is outside the bounds of the DataView

When searching this error, this github issue came up, which kind of reminded me of something I saw on that other Github issue from earlier.

To get a better look at what was going on, I compared the headers that were causing errors from hitting my local URL https://duckui-abc123.local.bsale.me to headers from a fresh duckdb -ui run locally. I was intrigued to find that some of the requests to the API were going through successfully, but there were two that weren’t: a POST to /localToken and a POST to /ddb/run. These were both returning 401s, and I verified this in the pod logs for the proxy.

As recommended in this comment on the earlier github issue , I set the proxy_set_header to localhost:4213 for each of Host, Origin, and Referer within the proxy config.

apiVersion: v1
kind: ConfigMap
metadata:
  name: duckui-nginx-conf
  namespace: custom
data:
  default.conf: |-
    server {
      listen 8080;

      location / {
        proxy_pass http://localhost:4213;
        proxy_set_header Host localhost:4213;
        proxy_set_header Origin http://localhost:4213;
        proxy_set_header Referer http://localhost:4213;
      }
    }

aaaaand this worked!

Verifying Catalog Connection

Now that I had a fully functioning DuckDB UI running within the cluster, I decided to see whether my script had correctly attached everything for use.

The first query that I ran was select * from duckdb_settings(); to see which settings were and weren’t set. I noticed that the S3 settings from my script were not set. As I alluded to earlier, this was a matter of changing the SET command to SET GLOBAL. Once I did that, these settings showed up as expected.

Upon connecting to the UI, as expected, the attached Iceberg catalog shows up in the left sidebar under the “Attached Databases” heading. Once I had the credentials and settings figured out, this had the expected schemas/namespaces from the catalog as well as the intended tables in those schemas. However, I wasn’t able to query the tables in the catalog. The issue actually turned out to be that DuckDB intends for each attached database to have a main schema. This was a simple fix: just create an empty namespace within the catalog called main.