[App] ToshLLM — local LLMs on Intel + AMD GPU (Metal, AMD‑patched llama.cpp, open source)

engeldlgado · Sunday at 02:13 AM

Hi all, im sharing a project that might be useful to anyone running an AMD GPU on a Hackintosh.

The problem: local‑LLM tooling on macOS targets Apple Silicon. On Intel Macs with discrete AMD GPUs, stock llama.cpp under Metal produces corrupted output and is painfully slow over PCIe.

ToshLLM is a native SwiftUI app (pure Swift Package Manager, no external deps) that bundles llama.cpp built with AMD‑specific patches and wraps it in a real GUI:

Correct Metal output on AMD dGPUs at full speed
Qwen3‑8B Q4: ~101 t/s prompt / ~57 t/s generation
Qwen3.6‑35B‑A3B (MoE, hybrid offload): ~123 t/s / ~18.6 t/s, up to ~25.7 t/s with MTP
Native chat (Markdown, code copy, file attachments), model manager with per‑model VRAM/RAM estimates, automatic MoE CPU‑offload, MTP speculative decoding, dual engines (official + TurboQuant for 100k+ ctx), built‑in benchmarks, OpenAI‑compatible API, bilingual EN/ES
New macOS 26 “Tahoe” Liquid Glass interface (degrades to translucent materials on macOS 14/15)
Hardware: developed on RX 6700 XT 12 GB + [NootRX](https://github.com/ChefKissInc/NootRX); runs on any working Metal setup

Its beta. DMGs aren’t notarized yet (first launch needs “Open Anyway” or `xattr -dr com.apple.quarantine`). The AMD patches live in the repo (`patches/`), so you can build from source too.

License: GPL‑3.0. Repo, source and DMG releases:

Link to Github Project

Would love testing reports from other macOS-supported AMD cards: RDNA 1 (RX 5500/5600/5700), RDNA 2 (RX 6600/6700/6800/6900), and older Polaris/Vega.

Edited Sunday at 04:53 PM by engeldlgado

mitch_de · Sunday at 07:39 AM

Hi,

worked on my Hackintosh with RX 5600 XT - I5 12400F 4,8 GHZ OC DDR5 ( and Macbook Pro RX560x - but very slow)

The smallest LLM gave 52.1 / 100.0 in the benchmark . close to your 6700XT?

Mobile RX 560X on MacbookPro

engeldlgado · Sunday at 10:27 AM

Nice, thanks for testing! That RX 5600 XT result is genuinely great. Just a heads-up for the comparison: you ran Qwen3-4B, while my 101/57 numbers are for the bigger Qwen3-8B — so not quite the same test. Your prompt speed basically matches my 6700 XT; generation is a bit lower because it's bandwidth-bound and the 5600 XT has less memory bandwidth (no Infinity Cache). It'll fit the 8B fine too (4.7 GB) if you want a direct apples-to-apples run — and your DDR5 + 12400F will really shine on the bigger MoE models.

The MacBook's RX 560X being slow is expected — that's an old Polaris chip with very little VRAM and bandwidth, so generation falls off a cliff (the model can't really stay resident on the GPU). Prompt still looks OK because it's batched, but ~1 t/s gen is just the card showing its age. The 5600 XT is the one to use.

For that 4B model i got 68 t/s | 146 t/s on the RX 6700 XT

If you grab the 8B and share the numbers, I'd love to add the RX 5600 XT as a tested card. Appreciate the report!

Edited Sunday at 10:29 AM by engeldlgado

JahStories · Sunday at 02:22 PM

Screenshot 2026-06-14 alle 16.21.35.png

Here my bench

P.s.

It's a real MacPro not an Hackintosh

JahStories · Sunday at 03:23 PM

Update, with the suggested model, qwen 3.6 35b a3b Window server crashed and I had to reboot my machine. If I try to do a benchmark with it, the system starts hanging for a few seconds, starts responding and hangs again, I can see the gpu being spiked but seems like it's not able to complete it and I have to stop it.

Regarding feedback, the webpage console is in Spanish even selecting English as main language and my 6900xt is recognized as a 6700.

If you need help translating the application to Italian, I can help.

engeldlgado · Sunday at 03:29 PM

1 hour ago, JahStories said:

Here my bench

P.s.

It's a real MacPro not an Hackintosh

That's an awesome result! Seeing it run so well on a real Mac with that kind of performance makes me wish I had 96GB of RAM and an RX 6900 XT to really push these and other models to their limits. Thanks so much for sharing your results.

Ill check the hardware detection logic, maybe its a UI Bug... Where exactly you see the detected GPU was a 6700 not a 6900?

Regarding multi-language support, I'm going to focus on making the app compatible with more languages in future updates to keep improving it. I'm excited to keep making the app better every day!

Edited Sunday at 03:44 PM by engeldlgado
more details

mitch_de · Sunday at 04:14 PM

On 6/14/2026 at 12:27 PM, engeldlgado said:

For that 4B model i got 68 t/s | 146 t/s on the RX 6700 XT

If you grab the 8B and share the numbers, I'd love to add the RX 5600 XT as a tested card. Appreciate the report!

I will Upload that 8B RX 5600XT bench result in a few hours.

Modell 4B was 52.1. / 100.

ADDED Modell 8B result : 36 / 56

JahStories · Sunday at 06:16 PM

2 hours ago, engeldlgado said:

That's an awesome result! Seeing it run so well on a real Mac with that kind of performance makes me wish I had 96GB of RAM and an RX 6900 XT to really push these and other models to their limits. Thanks so much for sharing your results.

Ill check the hardware detection logic, maybe its a UI Bug... Where exactly you see the detected GPU was a 6700 not a 6900?

Regarding multi-language support, I'm going to focus on making the app compatible with more languages in future updates to keep improving it. I'm excited to keep making the app better every day!

You already did an awesome job.

Regarding the detection, I can see the gpu reported as a 6700 in the web console page.

I'll later try the lighter 30b model later, and will tell you if that one hangs too like it did with the 35b one.

snooksy · Sunday at 11:45 PM

Hi,

On an Intel Mac Pro 7,1 will this support multiple GPUs to improve performance? The Mac Pro 7,1 can have up to 4x Radeon Pro w6800x GPUs.

Thanks

JahStories · 2026-06-15T22:57:39Z

I succesfully completed a bench with qwen 3 35b

0.6 Gen xD

JahStories · 2026-06-15T23:16:28Z

This is GPU only

I think there could be an issue with models that uses the CPU

Launhand · 2026-06-15T23:26:42Z

Interesting project — especially the part about fixing the AMD dGPU path in Metal, that’s usually where things get messy in Hackintosh setups.

The performance numbers you’re seeing are actually pretty impressive for local inference on that kind of hardware, particularly the MoE model behavior with CPU offload. That hybrid approach tends to make a big difference once you push beyond small 7–8B models.

Also good call on bundling the patches directly in the repo instead of hiding them behind a binary — that’s usually what makes or breaks adoption in the Hackintosh / experimental macOS space.
On a broader note, when projects like this move from “interesting demo” to actual community adoption, a lot of the challenge becomes less about raw performance and more about coordination, testing, and user feedback loops. Some teams even end up using event-driven tooling around releases, testing, or community coordination — for example setups like integrate eventbrite can sometimes be used in broader ecosystems where events, updates, and user engagement need to be tracked or automated.

Out of curiosity: how stable is it under longer sessions (say multi-hour chats or large context windows)? That’s usually where subtle GPU/Metal issues tend to show up even if benchmarks look solid.

Edited 6 hours ago by Launhand

fspkwonx86 · 2026-06-16T00:07:53Z

you can take it from me thats the craziest thing i ever saw is a local encyclopedia running on my computer

engeldlgado · 2026-06-16T02:25:49Z

Im working hard with a custom kernel for AMD for ollama that fix the Flash Attention issue for AMD Gpu thats require Silicon Mac, ill give news soon. 🫠

mitch_de · 2026-06-16T08:02:53Z

Hi, now I also tested the 8B Modell on my RX 5600XT

Bench values near same between latest .15 Version und the older .13 Version.

engeldlgado · 2026-06-16T13:58:16Z

Progress update, since a few of you asked about stability and I'd mentioned a kernel I was working on.

The big one: the AMD Flash Attention kernel is done and shipped. The reason it matters: on these AMD GPUs Metal refuses to run Flash Attention because it's gated on a hardware feature (simdgroup matrix multiply, the Apple7 family check) that the cards report as unavailable, and any quantized or compressed KV cache *requires* FA, so attention silently falls back to the CPU and generation collapses as context grows. I wrote a from-scratch Metal kernel that keeps both prompt processing and generation on the GPU instead. This week I optimized it further by splitting the KV stream across more simdgroups, and the decode-at-depth numbers jumped: on an 8B with a compressed cache, generation went +42% at 2k context and +75% at 4k (roughly 19 to 33 t/s), with output unchanged, validated on two different head dimensions. It's in the latest release as an opt-in toggle on the experimental engine.

On long sessions / large context (someone asked, and it's a fair question): that was actually broken in a subtle way and is now fixed. On a single GPU shared between the UI and inference, the chat was re-laying-out the whole Markdown transcript on every token, which starved the inference and froze generation for seconds at a time in long chats. The UI now only re-renders the part that's actually changing, so generation stays smooth in long multi-hour conversations, and combined with the kernel above, decode holds up at depth instead of falling off, but it needs further testing.

JahStories — that 35B MoE hang is real and I want to be straight about it. It's not the dense GPU path, it's the mixture-of-experts CPU-offload path. When experts get offloaded to system RAM and streamed back per token, the AMD GPU's command processor deadlocks under Metal and takes WindowServer down with it. You called it exactly with "I think there could be an issue with models that use the CPU" — that's precisely the case. It sits below the app, in the driver/Metal layer, so it isn't something I can patch from my side; the dense models that stay fully resident on the GPU are solid. The 0.6 t/s you saw is the offload thrashing against that, not the engine itself. I'm looking into whether there's an offload pattern that sidesteps the hang.

On the 6900 XT showing up as a 6700 and the web console stuck in Spanish: the console localization landed a couple of versions back, so updating should fix the language. The GPU label I'll track down — thanks for pointing specifically at the web console, that narrows where to look. And I'll take you up on the Italian translation offer, much appreciated.

On the Mac Pro 7,1 with multiple W6800X cards: multi-GPU isn't wired up yet. llama.cpp can split a model across devices, but the Metal-on-AMD path makes that non-trivial and I haven't validated it, so for now it's single-GPU. It's on the list.

Thanks for all the testing, genuinely. The RX 5600 XT and the real Mac Pro reports are exactly the kind of coverage I can't get on my own hardware, and it's already shaping what I work on next.

JahStories · 2026-06-16T14:53:07Z

thanks a lot for the feedback, hit me up with the og english text to translate to italian and I’ll try to do it asap!

[App] ToshLLM — local LLMs on Intel + AMD GPU (Metal, AMD‑patched llama.cpp, open source)

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation