Jump to content
44 posts in this topic

Recommended Posts

5 minutes ago, engeldlgado said:

Thanks for the feedback.

Sure thing man.

 

6 minutes ago, engeldlgado said:

Was the AI in the middle of generating a response when this error occurred?

No I just wanted to test and see how the app handles file attachment and attached several files and the error occurred, but it was able to analyze a single somewhat short text file without any errors. I have to say that the files I've attached were pretty large files so I guess that's what cause the error.

 

8 minutes ago, engeldlgado said:

I will note it down, but keep in mind that hardware combination might simply hit its limits when benchmarking a 4B model like Qwen3.

Thanks, yeah I didn't expect much from that rig but since you've asked for a benchmark on Polaris/Vega GPUs though I share my experience. 

10 minutes ago, engeldlgado said:

Also you tested the experimental engine? maybe work better because it has a custom kernel for AMD.

I will give it a try later and keep you posted.

  • Like 1
15 hours ago, engeldlgado said:

Try to use the smallest one first to test, btw, what kind of system spec you have?

Qwen 4B

image.thumb.png.6127489d1271f978bf97ae77d3df5524.png

I have both systems in my signature. Both based on CoffeeLake CPU's. RX 560 and 580.

 

  • Like 1
Posted (edited)
17 minutes ago, XanthraX said:

I have both systems in my signature. Both based on CoffeeLake CPU's. RX 560 and 580.

 

 

Sorry, i didnt notice because i was on the phone when i replied to you..

 

Your RX-580 is GCN/Polaris, not RDNA+...

My AMD decode kernel is only instantiated for RDNA+ (RX-5000/6000 series) maybe others but needs further testing, so ToshLLM won't work atm on your RX-580 and 560

 

But I'm going to study integrating a GCN/Polaris-compatible patch. I'll need to rewrite the kernel to use 64-lane SIMD groups instead of RDNA's 32-lane simdgroups, which is more complex, but I'm interested in exploring it.

 

I'll also study llama-metal old repo that i saw searching for this issue... to see if I can port it, to my patch to it and optimize it better for GCN GPUs.

 

I'll update if I make progress on GCN support... would you be willing to test it when I get a working solution?

Edited by engeldlgado
  • Like 1
Posted (edited)

Hi @Cyberdevs

Quick update, that i've work today:

Update (v0.81.25): you can now attach files in chat — including PDFs (text is extracted automatically, and scanned PDFs are read with on-device OCR), plus more text formats. And image input for vision models is in: drop in a vision model with its mmproj (e.g. gemma-3-4b) and you can attach an image and ask about it. Vision is experimental and the image encoder runs partly on CPU on AMD GPUs (some Metal ops aren't supported), so it works but isn't fully GPU-accelerated yet. DMG is building now.

 

Also i've add a option to change the default location for models

image.thumb.png.7c891159ecdf92a80033fe3b3d6b6162.png


Also may ask you for a new test on the RX Card... Update the app, and just load a model and start the server, no benchmark, anyting, just send me the logs, im researching about the VEGA/GCN Cards...

Edited by engeldlgado
  • Like 2
1 hour ago, Alpha22 said:

Settings?

 
Yeah in settings... theres is an option to change the Inference Engine (llama.ccp) bundle, the experimental one, has better improvements against the normal one

image.thumb.png.20b24a1d47c3335ba38ebe6be2975f44.png

On my AMD RX6800XT

Two top benchmarks are after enabling these settings in version Version 0.81.26 (0.81.26):

2 hours ago, engeldlgado said:

Yeah in settings... theres is an option to change the Inference Engine (llama.ccp) bundle, the experimental one, has better improvements against the normal one

01.png

 

I'll test my RX580 later and post the results.

  • Like 1
Posted (edited)

@Cyberdevs

 

That's a really great performance. I've been updating the app with new improvements. I'm more active on Reddit, but I'm still working on the issues reported here. I have a list of bugs to solve and things to improve, but I'm still checking this forum for new reports.

Edited by engeldlgado
  • Like 2
5 hours ago, engeldlgado said:

@Cyberdevs

 

That's a really great performance. I've been updating the app with new improvements. I'm more active on Reddit, but I'm still working on the issues reported here. I have a list of bugs to solve and things to improve, but I'm still checking this forum for new reports.

These new versions v.81.29 & v.81.30 working slowly in my hackintoshs X299 with RX-580 and Z690 with RX-570. Even H97M-E with RX-560 also working slowly by Benchmarks.

I'll test it again with RX-6600XT and hope it could be much better !

Screenshot 2026-06-20 at 15.01.20.png

Edited by jsl
  • Like 1

will be much more better than with RX 5x0!

My RX 560 only gets 1,3 /52 so even less than RX 580.  RX 5600XT multi times better. only minimal diff between app versions!

Normal für that GPU Type - much too old for modern GPU compute or AI.

 

 

  • Like 1
Posted (edited)
8 hours ago, jsl said:

These new versions v.81.29 & v.81.30 working slowly in my hackintoshs X299 with RX-580 and Z690 with RX-570. Even H97M-E with RX-560 also working slowly by Benchmarks.

I'll test it again with RX-6600XT and hope it could be much better !

Screenshot 2026-06-20 at 15.01.20.png



 

Thanks for the update! It was to be expected that it would be slow. I just wanted to verify if it was possible to get coherent text instead of garbage output.

Now that it's confirmed, I can look deeper into the fix. Keep an eye out for updates!

Follow the issue for RX-500 cards here... https://github.com/engeldlgado/toshllm/issues/1

the RX 6600 will be much faster trust me!

Edited by engeldlgado

Hello everyone, how has the experimental TurboQuant engine been working for you compared to the Bundle? I'm planning to keep just a single engine for maintainability, since managing both takes up too much time... if a bug occurs in one, I have to fix it in both. Honestly, in my testing, both have been stable. But it all comes down to what you say, as this is for the community, not just for me... so... what do you think?

Great to see you work hard for new versions.

 

I did again test on my AMD RX560 (mobile RX 560 in MacBook Pro.

Never had probs since first version, now on v .42.

I did add that extra option text for RX 560/580 you recommend.

1,5 / 50.2

Will do bench later  today also with my RX 5600XT and v .42

 

ToshLLM benchmark · 2026-06-27T08:05:29Z ===

model:  Qwen3-4B-Q4_K_M.gguf

GPU:    AMD Radeon Pro 560X

engine: bundled · K:f16 V:f16

FA:     AMD Flash Attention (GPU)

args:   -m /Users/andreas/models/Qwen3-4B-Q4_K_M.gguf -ngl 99 --mmap 0 -r 2 -fa 1

=========================

ggml_metal_device_init: probed SIMD-group width = 64

ggml_metal: probed SIMD-group width = 64 (32 = Apple/AMD RDNA, 64 = AMD GCN/Vega)

ggml_metal: wave64 safe mode ON (SIMD width 64, non-Apple GPU): simdgroup fast paths disabled for correct output

ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices

ggml_metal_library_init: using embedded metal library

ggml_metal_library_init: loaded precompiled '/Applications/ToshLLM.app/Contents/Resources/bin/default.metallib'

ggml_metal_library_init: loaded in 0.013 sec

ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)

ggml_metal_device_init: GPU name:   MTL0 (AMD Radeon Pro 560X)

ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)

ggml_metal_device_init: simdgroup reduction   = false

ggml_metal_device_init: simdgroup matrix mul. = false

ggml_metal_device_init: has unified memory    = false

ggml_metal_device_init: has bfloat            = false

ggml_metal_device_init: has tensor            = false

ggml_metal_device_init: use residency sets    = true

ggml_metal_device_init: use shared buffers    = false

ggml_metal_device_init: recommendedMaxWorkingSetSize  =  4294.97 MB

 

12 hours ago, mitch_de said:

Great to see you work hard for new versions.

 

I did again test on my AMD RX560 (mobile RX 560 in MacBook Pro.

Never had probs since first version, now on v .42.

I did add that extra option text for RX 560/580 you recommend.

1,5 / 50.2

Will do bench later  today also with my RX 5600XT and v .42

 

ToshLLM benchmark · 2026-06-27T08:05:29Z ===

model:  Qwen3-4B-Q4_K_M.gguf

GPU:    AMD Radeon Pro 560X

engine: bundled · K:f16 V:f16

FA:     AMD Flash Attention (GPU)

args:   -m /Users/andreas/models/Qwen3-4B-Q4_K_M.gguf -ngl 99 --mmap 0 -r 2 -fa 1

=========================

ggml_metal_device_init: probed SIMD-group width = 64

ggml_metal: probed SIMD-group width = 64 (32 = Apple/AMD RDNA, 64 = AMD GCN/Vega)

ggml_metal: wave64 safe mode ON (SIMD width 64, non-Apple GPU): simdgroup fast paths disabled for correct output

ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices

ggml_metal_library_init: using embedded metal library

ggml_metal_library_init: loaded precompiled '/Applications/ToshLLM.app/Contents/Resources/bin/default.metallib'

ggml_metal_library_init: loaded in 0.013 sec

ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)

ggml_metal_device_init: GPU name:   MTL0 (AMD Radeon Pro 560X)

ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)

ggml_metal_device_init: simdgroup reduction   = false

ggml_metal_device_init: simdgroup matrix mul. = false

ggml_metal_device_init: has unified memory    = false

ggml_metal_device_init: has bfloat            = false

ggml_metal_device_init: has tensor            = false

ggml_metal_device_init: use residency sets    = true

ggml_metal_device_init: use shared buffers    = false

ggml_metal_device_init: recommendedMaxWorkingSetSize  =  4294.97 MB

 


Hey @mitch_de try to teste with a custom build on this issue, https://github.com/engeldlgado/toshllm/issues/1#issuecomment-4775184370

Theres is a custom build im testing for the RX-500 series, so maybe this work better, also the project will be delayed for few days because there was an earthquake in my city few days ago. Fortunately, my family and friends are all safe, but internet connections were severely affected and part of my home was damaged... I hope to fully get back to the project soon. Best regards.

Hi,

i tested this NOAVX version on MacBook Pro RX 560.

Both models the small 4B one and the lmma8B failed - after much longer run compared to the v .42 version. All normal versions never failed.

qwen3 4B Q4_K - Medium         |   2.32 GiB |     4.02 B | MTL,BLAS   |       6 |   1 |    0 |           pp512 |         71.90 ± 0.25 |

test_gen: failed to decode generation batch, res = -3

llama_bench: error: failed to run gen

 

For me that's no problem that your app maybe need really a minimum of an RX 5600XT - they are now , buyer used, not. more expensive.

That non AVX2 cpus fail or RX 560/RX580 mama probs should not give you sleepless nights! Even they would run - they will be (useless) slow for that kind of KI GPU tasks - I think.

 

I will upload missed RX 5600 XT test with that NONAVX and normal v .42 version later.

×
×
  • Create New...