Running LLMs locally using any GPU on Fedora

9 February 2025

This post documents what I needed to do to run LLM models downloaded from Hugging Face on Linux using any Vulkan-capable GPU.

I've tested this on Intel integrated graphics, the Apple M2 Pro GPU, an Nvidia and AMD GPU. I'll do a follow up post which compares the acceleration versus ROCm (AMD), CUDA (Nvidia) and CPU-only (Apple).

Build dependencies

You'll need a copy of the Vulkan SDK, the GL shader language compiler, python, a C++ compiler, and CMake. Additonally, for downloading large language models from Hugging Face, you'll need git with lfs (large file support).

On Fedora, the following installs everything you need:

sudo dnf install vulkan-devel glslc glslang python-devel @development-tools cmake git-lfs curlpp-devel

Build llama.cpp

Checkout the llama.cpp source code and build it with Vulkan support enabled:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_VULKAN=1
cmake --build build --config Release

If you run into the following error, as I did on Asahi Linux:

cc1: error: unknown value ‘native+nodotprod+noi8mm+nosve’ for ‘-mcpu’
cc1: note: valid arguments are: cortex-a34 cortex-a35 cortex-a53 cortex-a57 cortex-a72 cortex-a73 thunderx thunderxt88p1 thunderxt88 octeontx octeontx81 octeontx83 thunderxt81 thunderxt83 ampere1 ampere1a ampere1b emag xgene1 falkor qdf24xx exynos-m1 phecda thunderx2t99p1 vulcan thunderx2t99 cortex-a55 cortex-a75 cortex-a76 cortex-a76ae cortex-a77 cortex-a78 cortex-a78ae cortex-a78c cortex-a65 cortex-a65ae cortex-x1 cortex-x1c neoverse-n1 ares neoverse-e1 octeontx2 octeontx2t98 octeontx2t96 octeontx2t93 octeontx2f95 octeontx2f95n octeontx2f95mm a64fx fujitsu-monaka tsv110 thunderx3t110 neoverse-v1 zeus neoverse-512tvb saphira cortex-a57.cortex-a53 cortex-a72.cortex-a53 cortex-a73.cortex-a35 cortex-a73.cortex-a53 cortex-a75.cortex-a55 cortex-a76.cortex-a55 cortex-r82 cortex-a510 cortex-a520 cortex-a710 cortex-a715 cortex-a720 cortex-a725 cortex-x2 cortex-x3 cortex-x4 cortex-x925 neoverse-n2 cobalt-100 neoverse-n3 neoverse-v2 grace neoverse-v3 neoverse-v3ae demeter generic generic-armv8-a generic-armv9-a
CMake Warning at ggml/src/ggml-cpu/CMakeLists.txt:151 (message):
  Failed to get ARM features
Call Stack (most recent call first):
  ggml/src/CMakeLists.txt:318 (ggml_add_cpu_backend_variant_impl)


-- Adding CPU backend variant ggml-cpu: -mcpu=native+nodotprod+noi8mm+nosve 
CMake Error at /usr/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:233 (message):
  Could NOT find Vulkan (missing: glslc) (found version "1.4.304")
Call Stack (most recent call first):
  /usr/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:603 (_FPHSA_FAILURE_MESSAGE)
  /usr/share/cmake/Modules/FindVulkan.cmake:595 (find_package_handle_standard_args)
  ggml/src/ggml-vulkan/CMakeLists.txt:4 (find_package)

Try again with LLVM instead of GCC:

rm -rf build
export CC=clang CXX=clang++
cmake -B build -DGGML_VULKAN=1
cmake --build build --config Release

Preparing a model

Prepare a virtualenv for running the conversion scripts:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

If you run into the following error:

ERROR: Ignored the following versions that require a different python version: 1.21.2 Requires-Python >=3.7,<3.11; 1.21.3 Requires-Python >=3.7,<3.11; 1.21.4 Requires-Python >=3.7,<3.11; 1.21.5 Requires-Python >=3.7,<3.11; 1.21.6 Requires-Python >=3.7,<3.11; 1.26.0 Requires-Python <3.13,>=3.9; 1.26.1 Requires-Python <3.13,>=3.9
ERROR: Could not find a version that satisfies the requirement torch~=2.2.1 (from versions: 2.6.0, 2.6.0+cpu)

[notice] A new release of pip is available: 24.2 -> 25.0
[notice] To update, run: pip install --upgrade pip
ERROR: No matching distribution found for torch~=2.2.1

Try replacing the torch version and try again:

git grep --name-only torch~= | xargs sed -i -e 's/torch~=2\.2\.1/torch~=2.6.0/'
pip install -r requirements.txt

Download a model of choice:

git lfs install
git clone https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B

Run the conversion script:

./convert_hf_to_gguf.py ./DeepSeek-R1-Distill-Qwen-14B/

By default, this script will convert the tensors into a format that llama.cpp can execute with the best 16-bit float values that your platform supports. You can also quantize to a different value size, so you can fit bigger models into your memory. For generating a model with 8-bit values, run:

./convert_hf_to_gguf.py --outtype q8_0 ./DeepSeek-R1-Distill-Qwen-14B/

You can even go smaller using the llama-quantize tool, such as q4_0 (4-bit), which is what Ollama does, for example:

./build/bin/llama-quantize ./DeepSeek-R1-Distill-Qwen-14B/DeepSeek-R1-Distill-Qwen-14B-Q8_0.gguf ./DeepSeek-R1-Distill-Qwen-14B/DeepSeek-R1-Distill-Qwen-14B-Q4_0.gguf Q4_K_M

Run the model

Depending on the amount of video memory you have available, you can offload more or fewer layers of the model to your graphics processor using the -ngl flag.

A bit of trial and error will help you figure out what's most appropriate for your hardware.

./build/bin/llama-cli -m ./DeepSeek-R1-Distill-Qwen-14B/DeepSeek-R1-Distill-Qwen-14B-Q8_0.gguf -p "You are a helpful assistant" -cnv -ngl 99

You can also run llama.cpp in benchmark mode:

✔ ./build/bin/llama-bench -ngl 99 -m ./DeepSeek-R1-Distill-Qwen-14B/DeepSeek-R1-Distill-Qwen-14B-Q8_0.gguf 
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA RTX A5000 Laptop GPU (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 14B Q8_0                 |  14.62 GiB |    14.77 B | Vulkan     |  99 |         pp512 |        698.36 ± 8.66 |
| qwen2 14B Q8_0                 |  14.62 GiB |    14.77 B | Vulkan     |  99 |         tg128 |         20.47 ± 0.11 |

build: 1782cdfe (4798)