Creating a Lakehouse Service: Part 1: Deploying a UI Pod
This is part one of a longer series .
On a personal note, I really enjoyed this part of the project because it helped me solidify some knowledge and learned some new stuff in the process.
Selecting Tools
DuckDB
I knew that I could use open table formats pretty easily with some of the more popular distributed engines such as Snowflake. But, after the Small Data 2025 conference, I was particularly motivated to see how much use I could get out of DuckDB and some of its built in tooling, such as the new UI. Additionally, I have been really impressed with DuckDB’s integrations. Not only have they been focusing on compatibility with different open table formats, but they’ve been putting their engine in some pretty cool places .
I feel like I’ve tinkered with Postgres a good amount, and, since my main use case (college basketball analytics) tends to be on the more analytical side of the spectrum, I decided to put some hours in on an OLAP engine. So, duckdb it was.
Ducklake vs Iceberg
While this project would technically be viable going with either DuckLake or Iceberg as my open table format of choice (since duckdb is compatible with each of them), I decided to use Iceberg since it has more interoperability with other database engines.
I think if I were really trying to go the route of more closely duplicating the efforts of MotherDuck (hosted DuckDB), I would have chosen DuckLake, but I was curious to get my hands dirty with Iceberg and learn some of the finer details.
Lakekeeper
Since I was using Iceberg, I needed a catalog to host and manage the metadata for the tables. I went with Lakekeeper since I had tinkered with it before and already had a Postgres instance running to be its backend. Additionally, the Lakekeeper team offers a Helm chart, which was very useful since I am running this in k8s.
Lakekeeper also offers an easy-to-use (but simple) UI if needed, which I decided to put on a LoadBalancer IP locally just to see what information was surfaced there.
homelab
I’ve been meaning to write more extensively on my homelab setup I have going, but consider this blog as part of that effort. :)
Long story short, earlier this year I bought four Lenovo ThinkCentre Tinys and clustered them together using k3s since I was interested in learning about containers and Kubernetes more specifically.
This is where I’ll be deploying all of these things and why this blog is k8s specific! Also, I have minio object store deployed locally in my homelab, so all references here to s3 are actually referring to s3-compatible, self-hosted object storage via minio.
Getting to Work
Creating DuckDB Sessions
Connecting to an iceberg catalog is
easy enough
with DuckDB. What I set out to do here was create a “script” that we could pass
to duckdb -init <script> that ran everything we needed (set the settings,
attach the catalog) before launching the UI. Our container setup, then, would be
pretty simple. I also wanted to make this as configurable as possible in areas
that I identified some customization/personalization would be useful, but my
main guiding principle was that I didn’t want users to have to run anything, I
just wanted them to essentially be automatically connected to the lake/lakehouse
(the catalog).
I was really impressed by the flexibility here of DuckDB’s
getenv function
for both
SETting individual session variables and attaching the catalog.
install httpfs;
load httpfs;
install iceberg;
load iceberg;
install ui;
load ui;
SET UI_LOCAL_PORT = 4213;
SET s3_region = 'us-west-2';
SET s3_access_key_id = getenv('S3_ACCESS_KEY_ID');
SET s3_secret_access_key = getenv('S3_SECRET_ACCESS_KEY');
SET s3_url_style = 'path';
SET s3_use_ssl = false;
ATTACH 'db' AS ice (
TYPE iceberg,
ENDPOINT getenv('CATALOG_ENDPOINT'),
authorization_type none --should match your lakekeeper auth settings
);
CALL start_ui_server();
I ended up having to use SET GLOBAL instead of just SET since launching the
UI technically starts a session that is distinct from the prior
commands/settings in the script above. Since deploying this, I have moved the s3
credentials to
duckdb secrets
.
I was able to deploy locally with duckdb -init script.sql to validate that I
would not need to make any additional settings changes or attach anything once
the UI session began.
Running the UI in a Pod
Now that I had the duckdb session prepared, I needed to actually get the
“remote” session running. I did this by creating a very simple dockerfile so I
could build an image to reference within my Pod definition. I’m using a pod
name with a “random” string (abc123) attached to somewhat simulate how I want
the “spinning up a pod” to work.
FROM duckdb/duckdb:1.4.2
COPY lake.sql /etc/lake.sql
ENTRYPOINT ["/duckdb", "-init", "/etc/lake.sql"]
apiVersion: v1
kind: Pod
metadata:
name: duckui-abc123
labels:
app: duckui
session-id: abc123
spec:
containers:
- name: duckui
image: bsale/duckui:latest
ports:
- containerPort: 4213
env:
- name: S3_ACCESS_KEY_ID
value: <secret value> # recommend actually using a k8s secret here
- name: S3_SECRET_ACCESS_KEY
value: <value> # recommend actually using a k8s secret here
- name: CATALOG_ENDPOINT
value: "http://172.x.y.z:8181/catalog"
The advantage to explicitly setting the ui_local_port in the duckdb config is
that it reliably deploys to a set port, which makes the Pod config pretty easy
to read.
Lessons learned: tty
At first, my pod deployment was repeatedly failing: it would show as successful for a half second and would crash with a segmentation fault.
I found that this is because duckdb expects to be run at a terminal, and the
terminal remains open when the UI is open. The pod was trying to close the
terminal since it ran the command(s) it was supposed to. The solution here is to
use stdin and tty. I don’t love this solution since it requires force
stopping the pod in the case you need to re-create it, but it does mean
everything works. I’ll be looking for a better solution for this part.
apiVersion: v1
kind: Pod
metadata:
name: duckui-abc123
labels:
app: duckui
session-id: abc123
spec:
containers:
- name: duckui
image: bsale/duckui:latest
stdin: true
tty: true
ports:
- containerPort: 4213
env:
- name: S3_ACCESS_KEY_ID
value: <secret value? # recommend actually using a k8s secret here
- name: S3_SECRET_ACCESS_KEY
value: <value> # recommend actually using a k8s secret here
- name: CATALOG_ENDPOINT
value: "http://172.x.y.z:8181/catalog"
Lessons Learned: localhost vs 0.0.0.0
Everything to this point has been pretty smooth sailing, but then we enter networking land. Everyone’s favorite, right?
DuckDB does not take any config about which interface(s) the UI deploys to, it
only deploys to localhost. There’s a
GitHub issue
about this (hint:
the eolution is discussed within that issue). I have used sidecars before, but
this was my first opportunity (read: first time it was necessary) to deploy one
alongside something custom.
Simple enough, we get an nginx container running alongside our custom duckdb container and we set up our proxy.
apiVersion: v1
kind: Pod
metadata:
name: duckui-abc123
labels:
app: duckui
session-id: abc123
spec:
containers:
- name: duckui
image: bsale/duckui:latest
### omitting rest of duckui config for brevity
- name: ui-proxy
image: nginx:alpine
volumeMounts:
- name: nginx-conf
mountPath: /etc/nginx/conf.d
ports:
- name: ui
containerPort: 8080
volumes:
- name: nginx-conf
configMap:
name: duckui-nginx-conf
---
apiVersion: v1
kind: ConfigMap
metadata:
name: duckui-nginx-conf
namespace: custom
data:
default.conf: |-
server {
listen 8080;
location / {
proxy_pass http://localhost:4213;
}
}
The lesson learned here: localhost is only available to that local host. Not accessible outside of that. I think I knew that from a personal computer standpoint, but didn’t really think through its implications too much as it pertained to containers and Kubernetes. So we have to add the proxy to kind of “tunnel” the external connection to what we actually care about.
Lessons learned: Secure hosts
Next, I navigated to my url/IP and found… this error:
auth0-spa-js must run on a secure origin. See https://github.com/auth0/auth0-spa-js/blob/main/FAQ.md#why-do-I-get-auth0-spa-js-must-run-on-a-secure-origin for more information.
Cool, something I recognized from that Github issue earlier. New errors are good! After some reading, I learned that localhost is a trusted origin by default. But since we were coming from outside localhost, we would need a solution here. ChatGPT recommended that I simply port-forward here, but that seemed like a very manual and hacky solution that wouldn’t “scale” to what I wanted from this project.
I use traefik for my k8s cluster. I configured an ingress route so that I
could leverage https, which is a “trusted source”. This is a more repeatable,
“scalable” solution, since we could make an ingress route straight to each pod
later on in the project.
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: duckui-abc123
namespace: traefik
annotations:
kubernetes.io/ingress.class: traefik-external
spec:
entryPoints:
- websecure
routes:
- match: Host(`duckui-abc123.local.bsale.me`)
kind: Rule
services:
- name: duckui-abc123
kind: Service
namespace: custom
port: 80
tls:
secretName: my-tls-cert
Lessons learned: Proxy headers
Simply using https fixed the error from the last section. Now, I was presented with a more obscure error:
Failed to resolve app state with user - RangeError: Offset is outside the bounds of the DataView
When searching this error, this github issue came up, which kind of reminded me of something I saw on that other Github issue from earlier.
To get a better look at what was going on, I compared the headers that were
causing errors from hitting my local URL https://duckui-abc123.local.bsale.me
to headers from a fresh duckdb -ui run locally. I was intrigued to find that
some of the requests to the API were going through successfully, but there
were two that weren’t: a POST to /localToken and a POST to /ddb/run.
These were both returning 401s, and I verified this in the pod logs for the
proxy.
As recommended in
this comment on the earlier github issue
,
I set the proxy_set_header to localhost:4213 for each of Host, Origin,
and Referer within the proxy config.
apiVersion: v1
kind: ConfigMap
metadata:
name: duckui-nginx-conf
namespace: custom
data:
default.conf: |-
server {
listen 8080;
location / {
proxy_pass http://localhost:4213;
proxy_set_header Host localhost:4213;
proxy_set_header Origin http://localhost:4213;
proxy_set_header Referer http://localhost:4213;
}
}
aaaaand this worked!
Verifying Catalog Connection
Now that I had a fully functioning DuckDB UI running within the cluster, I decided to see whether my script had correctly attached everything for use.
The first query that I ran was select * from duckdb_settings(); to see which
settings were and weren’t set. I noticed that the S3 settings from my script
were not set. As I alluded to earlier, this was a matter of changing the SET
command to SET GLOBAL. Once I did that, these settings showed up as expected.
Upon connecting to the UI, as expected, the attached Iceberg catalog shows up in
the left sidebar under the “Attached Databases” heading. Once I had the
credentials and settings figured out, this had the expected schemas/namespaces
from the catalog as well as the intended tables in those schemas. However, I
wasn’t able to query the tables in the catalog. The issue actually turned out to
be that DuckDB intends for each attached database to have a main schema. This
was a simple fix: just create an empty namespace within the catalog called
main.