[App] ToshLLM — local LLMs on Intel + AMD GPU (Metal, AMD‑patched llama.cpp, open source)

engeldlgado · 2026-06-14T02:13:01Z

Hi all, im sharing a project that might be useful to anyone running an AMD GPU on a Hackintosh.

The problem: local‑LLM tooling on macOS targets Apple Silicon. On Intel Macs with discrete AMD GPUs, stock llama.cpp under Metal produces corrupted output and is painfully slow over PCIe.

ToshLLM is a native SwiftUI app (pure Swift Package Manager, no external deps) that bundles llama.cpp built with AMD‑specific patches and wraps it in a real GUI:

Correct Metal output on AMD dGPUs at full speed
Qwen3‑8B Q4: ~101 t/s prompt / ~57 t/s generation
Qwen3.6‑35B‑A3B (MoE, hybrid offload): ~123 t/s / ~18.6 t/s, up to ~25.7 t/s with MTP
Native chat (Markdown, code copy, file attachments), model manager with per‑model VRAM/RAM estimates, automatic MoE CPU‑offload, MTP speculative decoding, dual engines (official + TurboQuant for 100k+ ctx), built‑in benchmarks, OpenAI‑compatible API, bilingual EN/ES
New macOS 26 “Tahoe” Liquid Glass interface (degrades to translucent materials on macOS 14/15)
Hardware: developed on RX 6700 XT 12 GB + [NootRX](https://github.com/ChefKissInc/NootRX); runs on any working Metal setup

Its beta. DMGs aren’t notarized yet (first launch needs “Open Anyway” or `xattr -dr com.apple.quarantine`). The AMD patches live in the repo (`patches/`), so you can build from source too.

License: GPL‑3.0. Repo, source and DMG releases:

Link to Github Project

Would love testing reports from other AMD cards (6600/6800/6900, RDNA3, Polaris/Vega).

mitch_de · 2026-06-14T07:39:19Z

Hi,

worked on my Hackintosh with RX 5600 XT - I5 12400F 4,8 GHZ OC DDR5 ( and Macbook Pro RX560x - but very slow)

The smallest LLM gave 52.1 / 100.0 in the benchmark . close to your 6700XT?

Mobile RX 560X on MacbookPro

engeldlgado · 2026-06-14T10:27:45Z

Nice, thanks for testing! That RX 5600 XT result is genuinely great. Just a heads-up for the comparison: you ran Qwen3-4B, while my 101/57 numbers are for the bigger Qwen3-8B — so not quite the same test. Your prompt speed basically matches my 6700 XT; generation is a bit lower because it's bandwidth-bound and the 5600 XT has less memory bandwidth (no Infinity Cache). It'll fit the 8B fine too (4.7 GB) if you want a direct apples-to-apples run — and your DDR5 + 12400F will really shine on the bigger MoE models.

The MacBook's RX 560X being slow is expected — that's an old Polaris chip with very little VRAM and bandwidth, so generation falls off a cliff (the model can't really stay resident on the GPU). Prompt still looks OK because it's batched, but ~1 t/s gen is just the card showing its age. The 5600 XT is the one to use.

For that 4B model i got 68 t/s | 146 t/s on the RX 6700 XT

If you grab the 8B and share the numbers, I'd love to add the RX 5600 XT as a tested card. Appreciate the report!

Edited 3 hours ago by engeldlgado

[App] ToshLLM — local LLMs on Intel + AMD GPU (Metal, AMD‑patched llama.cpp, open source)

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation