Jump to content

Metal Particles (as demo /bench) new Nbody-Metal (demo/bench)


mitch_de
 Share

144 posts in this topic

Recommended Posts

Yep, the faster the gpu can compute (more + faster compute units) you can set numbodies higher and get more GFLOPS, because fast gpus are not under full load by "only" 32K bodies.

Lowend or midrange gpus will not get higher GFLOPS by using more bodies as 32K.

For all gpus same : more numbodies = less FPS. :) I think if you get more than 10 FPS with 32K you can try to bench with 64K bodies and look for perhaps more GFLOPS.

Even 128K bodies should work - i dont know - for usage with very fast gpus.

EDIT: Yep, 128K works also on my lowend GT 740. Same GFLOPS as with 64K, around 332 GFLOPS - but only 1.0 FPS, with 256K 0.2 FPS  ;)

 

So, if you have any gpu which is not lowend (GT 2/4/5/610,..20,..30) better start with 64K bodies to get close to the max. GFLOPS.

 

32768 = 32K

65536 = 64K

131072 = 128K

262144 = 256K (maybe for highend gpus like GTX 960+ usable) , my  GT 740 gpu slows/stalls the whole OS X GUI running 256K bodies.

Link to comment
Share on other sites

Yep, nbody cuda can use > 1 gpu by adding numdevices= parameter 2,3,4...

Great to see first 2+ gpus compute nbdody result getting 2400 GFlops.

Try to use -benchmark to compare GFLOPS without any cpu/gpu work for OpenGL rendering.

Very fast gpus didnt show much diff - at least running 64K+ bodies. Lowend gpus or lowend cpus will show differences, because combined OpenGL / gpu compute task slows down

the GFLOPS for gpu computing.

Also older, highend GPUs (fermi, kepler) which are even faster in OpenGL than newer midrange  kepler (vs fermi) /  maxwell(vs fermi, kepler) gpus are often much slower in gpu computing (OpenCL, CUDA).

My GT 740(kepler)  DDR3 for example ist only 5-10% faster in OpenGL to my older GT 440 DDR5 (fermi) gpu.

But much faster in CUDA, OpenCL- up to 2 times faster, average 30% faster.

Link to comment
Share on other sites

Yep, nbody cuda can use > 1 gpu by adding numdevices= parameter 2,3,4...

Great to see first 2+ gpus compute nbdody result getting 2400 GFlops.

Try to use -benchmark to compare GFLOPS without any cpu/gpu work for OpenGL rendering.

Very fast gpus didnt show much diff - at least running 64K+ bodies. Lowend gpus or lowend cpus will show differences, because combined OpenGL / gpu compute task slows down

the GFLOPS for gpu computing.

Also older, highend GPUs (fermi, kepler) which are even faster in OpenGL than newer midrange  kepler (vs fermi) /  maxwell(vs fermi, kepler) gpus are often much slower in gpu computing (OpenCL, CUDA).

My GT 740(kepler)  DDR3 for example ist only 5-10% faster in OpenGL to my older GT 440 DDR5 (fermi) gpu.

But much faster in CUDA, OpenCL- up to 2 times faster, average 30% faster.

post-1181448-0-01311200-1448280878_thumb.png

post-1181448-0-23008700-1448280890_thumb.png

post-1181448-0-52704400-1448280907_thumb.png

Link to comment
Share on other sites

Use at least 64K numbodies. Otherwise , like 16K with 2 cuda devices or 8K the gpus will not get full work load. like 16K 1900 GFlops vs 64K 2400 even using OpenGL.

64K (or 128K) will may give much higher = same or little higher GFLOPS  as the 64K non benchmark (OPenGL window) 2400 GFlops.

 

65536 = 64K

131072 = 128K

 

Less than 64K (like 32K ....8K) bodies may only outperform lowend gpus!

Less than 64K (midrange+ gpu) is more an OpenGL Bench as an gpu compute bench.

Reduced FPS by more bodies doesn´t matter (running non benchmark, OpenGL runs) - Nbody CUDA an gpu compute bench.

Running very less numbodies, like 2K or 8K - is 90% cpu+OpenGL bench (GFLOPS only 1/3 - 1/2 of max. GFLOPS), 64K+ 90% gpu compute bench, running in -benchmark mode 95%.

And the focus is only on the GFLOPS, not OpenGL FPS.



 

Link to comment
Share on other sites

Use at least 64K numbodies. Otherwise , like 16K with 2 cuda devices or 8K the gpus will not get full work load. like 16K 1900 GFlops vs 64K 2400 even using OpenGL.

64K (or 128K) will may give much higher = same or little higher GFLOPS  as the 64K non benchmark (OPenGL window) 2400 GFlops.

 

65536 = 64K

131072 = 128K

 

Less than 64K (like 32K ....8K) bodies may only outperform lowend gpus!

Reduced FPS by more bodies doesn´t matter (running non benchmark, OpenGL runs) - Nbody CUDA an gpu compute bench.

And the focus is only on the GFLOPS, not OpenGL FPS.

 

 

post-1181448-0-00636500-1448282351_thumb.png

post-1181448-0-32740200-1448282366_thumb.png

post-1181448-0-34107200-1448282431_thumb.png

post-1181448-0-86478400-1448282440_thumb.png

post-1181448-0-27249100-1448282504_thumb.png

post-1181448-0-78237400-1448282514_thumb.png

Link to comment
Share on other sites

Yep, and dont worry about different GFLOPS shown in CUDA-z(OpenSource vs Nbody CUDA(by Nvidia).

Differnet compute code (Nbody much more complex), different GFLOPS.

Seems that Nbody Cuda (from Nvidia) likes/benefit more from the modern maxwell gpu vs kepler gpu  than CUDA-Z:

Nbody 1446 / 1150 GFLOPS = maxwell GTX 960 is 1,25 times faster than kepler GTX 660 TI

CUDA Z : 2709 / 2312 GFLOPS = maxwell GTX 960 is "only" 1,17 times faster than kepler GTX 660 TI

Link to comment
Share on other sites

 Share

×
×
  • Create New...