Running llama-swap and llama.cpp on Strix Halo

Since my last post, I've been mostly content running llama.cpp directly on my Strix Halo laptop, but I've recently switched to running llama-swap.

While the Strix Halo hardware technically supports ROCm, I've been having very disappointing performance and far too many stability issues with it, so I've mostly been running things using the Vulkan backend.

Llama-swap is a pretty handy piece of software that offers an OpenAI compatible API server and loads LLM models on demand and provides a convenient way to manage which models are loaded and unloaded.

At first, I just ran their Vulkan-backed docker image as described on their README:

  docker run -it --rm --device /dev/dri -p 9292:8080 \
    -v ~/.cache/llama.cpp:/home/ubuntu/.cache/llama.cpp \
    -v ./llama-swap.yaml:/app/config.yaml \
    ghcr.io/mostlygeek/llama-swap:vulkan

However, this uses the Mesa radv Vulkan driver. This is an excellent driver, and AMD has decided to directly support its further development.

However, at the time of writing, you can see a ~50% higher token generation speed for some models when using the older (and now seemingly abandoned) AMDVLK driver, which was AMD's in house open-source Linux driver.

Sadly, there wasn't an easy option for running llama-swap with these drivers, so I made my own image.

The sources for these can be found at github.com/wvdschel/llama-swap-amdvlk, along with a copy of my llama-swap config to help you get started should you want to do the same thing.

To run the image, simply run the following command:

  docker run --rm --name llama-swap -p 9292:8080 --device /dev/dri \
  --volume ./llama-swap.yaml:/app/config.yaml \
  --volume ~/.cache/llama.cpp:/cache/llama.cpp \
  quay.io/wvdschel/llama-swap-amdvlk:latest

Update the path of your llama-swap.yaml or config.yaml file as needed.

The repository also includes a podman quadlet file, that you can install if you want llama-swap to run whenever you log in to your desktop.