Creating a Lakehouse Service: Background

I haven’t written nearly as much as I’d have liked to. However, I’m starting a new job next week and took the holiday week off in between jobs and decided it was a good time to hack around a bit. This will be part one (zero?) of a multiple-part series (as I actually start to piece these things together).

Lakehouse?

Very early on in my data scientist-to-engineer transition, the lakehouse became a very appealing architecture pattern and started receiving a lot of attention. At work (previous employer), we were working on migrating analytics workloads off of a SQL Server and onto Snowflake. For a hackathon (three-ish years ago!), my team and I decided to take a look at the dbt-external-tables package for some ingestion tasks. At the time, the package was really focused on snowpipes, so that was the focus of our project as well.

At this previous employer, we didn’t really have the “real time” needs that would justify using snowpipes for all ingestion tasks. However, the experience showed me some of the cool potential of the lakehouse architecture. Of all the benefits, I think the most appealing to me was “you can tear this table down, change the transformations (if you want/need), and fully rebuild the table” since the source data is readily available.

As I spent more time learning about open table formats, they continued gaining in popularity. I poked around with iceberg, but didn’t really have data at the scale that would justify using an open table format. But I always wanted to learn more. Earlier this year, I attended the Iceberg Summit 2025 and met some users that were really excited about the potential. At that summit, I also met Christian of Lakekeeper , an open-source Iceberg catalog. IMO, the catalogs are one of the clunkier bits of using these open table formats, and it was great to see someone trying to make an easy-to-use solution.

Overall, after playing around with some of the tooling (more on this in a second), I decided to commit to see what adopting a lakehouse pattern would look like for ballmart and in my homelab.

Enter DuckDB

I was in love the second I started using DuckDB. I’ve always tried to find creative uses for it since it is such a powerful yet simple engine. When I first saw they were experimenting with open table formats, I got excited.

Ducklake

Earlier this year, DuckDB released DuckLake , a duckdb-native open table format. As I mentioned earlier, one of the clunky parts of open table formats such as iceberg is the “catalog”, and I enjoyed that the DuckDB ducklake introduction blog called this out as one of the ironies of open table formats: namely, we’re creating this “filesystem as database” sort of idea (“replacing” the database with files and some metadata), but you need a database to host that metadata and point to the files. Ducklake kind of made this easier; instead of needing a catalog (which in turn uses a database behind the scenes), Ducklake instead just explicitly has you connect to a database for the metadata storage.

Since DuckDB is an in-process database, I always wondered how I could get it in the hands of more non-technical users. Obviously, it has infinitely many uses as “just” that in-process engine, but I had always wondered whether you could “serve” DuckDB. Ducklake, in a sense, and especially for read-heavy workloads, offers the ability to “serve” duckdb datasets. While the engine itself runs in one node, the data that the node is accessing can live in some shared location. With configuration, individual DuckDB nodes (MotherDuck refers to them as “ducklings”), no matter if they are running in the cloud or on someone’s local machine, can point to the same data tables, much like a database server.

DuckDB UI

Something that I learned in my previous job is that people are generally deeply distrustful of the command line. There’s just a general preference for a user interface, especially for more non-technical users. I think there is something to be said for “meeting your users where they’re at”, and I can understand being uncomfortable at a terminal prompt if that is not something you have a lot of experience with.

I’m really happy that DuckDB launched a UI to address users such as these. The UI is simplistic but it contains pretty much all of the features that a user truly needs. I particularly like the Snowflake-esque schema summary and summary statistics when you click on a table that is attached in the DuckDB session. Most users’ first exposure to databases and SQL is a worksheet/query-style interface, and DuckDB stuck to more of a “notebook”-style interface that is not unfamiliar to the worksheet/query style interface users.

Project

The idea for my project is as follows: I want to see how viable it is to use a lakehouse-only architecture at an organization. I’ll use my homelab and ballmart for this task to simulate having different data sources and different users and workloads.

My end goals:

User issues a command (command line? Some UI?) to either
- spin up the local UI (user would use their own computer)
  - Pod auto-connects to a Lakekeeper catalog and can access Iceberg tables in the catalog
- Command triggers creation of a Kubernetes pod
  - Pod is running an instance of the DuckDB UI
  - Pod auto-connects to a Lakekeeper catalog and can access Iceberg tables in the catalog
  - Pod UI is accessible via browser
Something tears down the pod when done (TTL? some command within the pod?)

Part One: Deploying a UI Pod

My first part of this project was getting a single Kubernetes pod running the UI for a user to use. I picked this part first since it would be the core of the other parts of the project. Since the image is just firing up the DuckDB UI and running a simple init script, this config should apply very nicely to the “local” deployment method, too.

Part Two: Creating the API

My second part of this project was getting the Kubernetes resources created in order to run the “remote” version of the DuckDB pod. I enjoyed this section since I was really impressed with the python Kubernetes API.

Part Three: Creating a CLI

Users need an easy way to access the API from the previous step. This is the actual “user issues a command…” portion of this project.

Ideally, this CLI has some form of “sessions” so that users don’t have to manage which “session ID"s they have open. Perhaps use something like a ~/.config/duckui.json file or something that keeps some state information.