Jump to content

Metal Particles (as demo /bench) new Nbody-Metal (demo/bench)


mitch_de
 Share

144 posts in this topic

Recommended Posts

I made some changes to the particle demo, would you like me to send you the diffs?

 

- Use MTLStorageModeManaged for all buffers.

- Render to a texture directly, instead of the drawable.

- Every N frames (I used 32), copy the texture to the drawable so the results are visible.

- Every M frames (I used 8), update the FPS display.

 

I bumped the particle count up to 16M and as you can see, my GeForce GTX 980 can handle that without breaking a sweat (i.e. around 180 FPS).  By rendering to a texture instead of the current drawable, the animation isn't blocked by the CoreAnimation 60 FPS limit.

 

 

post-1204197-0-32506100-1447625005_thumb.png

Link to comment
Share on other sites

I made some changes to the particle demo, would you like me to send you the diffs?

 

- Use MTLStorageModeManaged for all buffers.

- Render to a texture directly, instead of the drawable.

- Every N frames (I used 32), copy the texture to the drawable so the results are visible.

- Every M frames (I used 8), update the FPS display.

 

I bumped the particle count up to 16M and as you can see, my GeForce GTX 980 can handle that without breaking a sweat (i.e. around 180 FPS).  By rendering to a texture instead of the current drawable, the animation isn't blocked by the CoreAnimation 60 FPS limit.

 

Diff to what source?

I used this source : https://github.com/FlexMonkey/MetalKit-Particles for my little changes / tests.

Link to comment
Share on other sites

I made some changes to the particle demo, would you like me to send you the diffs?

 

- Use MTLStorageModeManaged for all buffers.

- Render to a texture directly, instead of the drawable.

- Every N frames (I used 32), copy the texture to the drawable so the results are visible.

- Every M frames (I used 8), update the FPS display.

 

I bumped the particle count up to 16M and as you can see, my GeForce GTX 980 can handle that without breaking a sweat (i.e. around 180 FPS).  By rendering to a texture instead of the current drawable, the animation isn't blocked by the CoreAnimation 60 FPS limit.

 

Is it available somewhere for DL?

  • Like 1
Link to comment
Share on other sites

Yep, to be more excat i meaned the AMD (shown by Ciro82) have probs with both of that METAL demos particles/Nbody.

That doenst mean that METAL in general has probs with AMD.

The autor of both source code has only Nvidia Macbook GPU so perhaps there are some bugs for AMD gpus in the code.

At least NVidia works with both demos without showing wreid colours  (particles) or very less stars (Nbody)

  • Like 1
Link to comment
Share on other sites

AMD 6670

Process:               OSXMetalParticles [796]
Path:                  /Users/USER/Downloads/OSXMetalParticles_2Mill_final.app/Contents/MacOS/OSXMetalParticles
Identifier:            uk.co.flexmonkey.OSXMetalParticles
Version:               1.0 (1)
Code Type:             X86-64 (Native)
Parent Process:        ??? [1]
Responsible:           OSXMetalParticles [796]
User ID:               501

Date/Time:             2015-11-18 22:42:09.613 +0300
OS Version:            Mac OS X 10.11.2 (15C47a)
Report Version:        11
Anonymous UUID:        43065937-BBF0-8F2F-C339-5635BC71CE03


Time Awake Since Boot: 4200 seconds

System Integrity Protection: disabled

Crashed Thread:        0  Dispatch queue: com.apple.main-thread

Exception Type:        EXC_BAD_INSTRUCTION (SIGILL)
Exception Codes:       0x0000000000000001, 0x0000000000000000
Exception Note:        EXC_CORPSE_NOTIFY

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   uk.co.flexmonkey.OSXMetalParticles	0x0000000101395c68 _TFC17OSXMetalParticles11ParticleLabcfMS0_FT5widthSu6heightSu12numParticlesOS_13ParticleCount_S0_ + 1880
1   uk.co.flexmonkey.OSXMetalParticles	0x00000001013923d8 _TFC17OSXMetalParticles18GameViewController11viewDidLoadfS0_FT_T_ + 120
2   uk.co.flexmonkey.OSXMetalParticles	0x0000000101392776 _TToFC17OSXMetalParticles18GameViewController11viewDidLoadfS0_FT_T_ + 22
3   com.apple.AppKit              	0x00007fff8aee45bc -[NSViewController _sendViewDidLoad] + 97
4   com.apple.CoreFoundation      	0x00007fff889cb33f -[NSSet makeObjectsPerformSelector:] + 223
5   com.apple.AppKit              	0x00007fff8ad93eb2 -[NSIBObjectData nibInstantiateWithOwner:options:topLevelObjects:] + 1142

Link to comment
Share on other sites

Yep, to be more excat i meaned the AMD (shown by Ciro82) have probs with both of that METAL demos particles/Nbody.

That doenst mean that METAL in general has probs with AMD.

The autor of both source code has only Nvidia Macbook GPU so perhaps there are some bugs for AMD gpus in the code.

At least NVidia works with both demos without showing wreid colours  (particles) or very less stars (Nbody)

my 290x had no issues with any of the tests plus i work with metal in the ue4 engine all the time 

Link to comment
Share on other sites

Then the AMD prob with those demos is same as OpenCL probs : depends on GPU subtype. Some work, some not.

And as i said, that doesnt mean Metal has an general prob with AMD. Like OpenCL, more complex code, some gpus may fail even OpenCL works in general.

Link to comment
Share on other sites

Yeah let me clean some things up and I'll package it up ASAP.  I'm going to reach out to the original author and see if he'll merge my changes into the original version as well.

 

Would be great.

 

In the meantime Nbody of R9 280X (MSI crappy one) in an original, old Mac Pro 2006 with all available bodies count.

post-1130320-0-20571300-1447970417_thumb.png

post-1130320-0-89388600-1447970438_thumb.png

post-1130320-0-48808500-1447970445_thumb.png

post-1130320-0-97326200-1447970454_thumb.png

post-1130320-0-29076200-1447970462_thumb.png

Link to comment
Share on other sites

Nbody (by NVDIA) - CUDA only.

 

Very interesting functions (command line parameters)!

 

-fp64             (use double precision floating point values for simulation)
-hostmem          (stores simulation data in host memory)
-benchmark        (run benchmark to measure performance) - benchresults (GFLOPS) shown without viewing nbody window = more valide vs much OpenGL & higher cpu  usage beside CUDA compute tasks
-numbodies=<N>    (number of bodies (>= 1) to run in simulation) 
-device=<d>       (where d=0,1,2.... for the CUDA device to use) - if you have more than one CUDA device you can select the benched device
-numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation) - if you have more than one CUDA device you can use ALL CUDA devices together
-cpu              (run n-body simulation on the CPU) - cpu GFLOPS
 
nbody -numbodies=32768 -benchmark
> Compute 3.0 CUDA device: [GeForce GT 740]

number of bodies = 32768

32768 bodies, total time for 10 iterations: 645.157 ms

= 16.643 billion interactions per second

= 332.862 single-precision GFLOP/s at 20 flops per interaction

 
with window - which has no 60 FPS limit  - GTX 6xx/ 9xx  and/  or 2 CUDA GPUs (using -numdevices=2)   will show that ;)
nbody -numbodies=32768
post-110586-0-39140400-1448015710_thumb.jpg
 
Link to comment
Share on other sites

 

Nbody (by NVDIA) - CUDA only.

 

Very interesting functions (command line parameters)!

 

-fp64             (use double precision floating point values for simulation)
-hostmem          (stores simulation data in host memory)
-benchmark        (run benchmark to measure performance) - benchresults (GFLOPS) shown without viewing nbody window = more valide vs much OpenGL & higher cpu  usage beside CUDA compute tasks
-numbodies=<N>    (number of bodies (>= 1) to run in simulation) 
-device=<d>       (where d=0,1,2.... for the CUDA device to use) - if you have more than one CUDA device you can select the benched device
-numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation) - if you have more than one CUDA device you can use ALL CUDA devices together
-cpu              (run n-body simulation on the CPU) - cpu GFLOPS
 
nbody -numbodies=32768 -benchmark
> Compute 3.0 CUDA device: [GeForce GT 740]

number of bodies = 32768

32768 bodies, total time for 10 iterations: 645.157 ms

= 16.643 billion interactions per second

= 332.862 single-precision GFLOP/s at 20 flops per interaction

 
with window - which has no 60 FPS limit  - GTX 6xx/ 9xx  and/  or 2 CUDA GPUs (using -numdevices=2)   will show that ;)
nbody -numbodies=32768
 

 

 

hello :)

 

works for Yosemite ?

 

thanks

Link to comment
Share on other sites

> Compute 3.5 CUDA device: [GeForce GTX 780]

number of bodies = 32768

32768 bodies, total time for 10 iterations: 120.707 ms

= 88.954 billion interactions per second

= 1779.087 single-precision GFLOP/s at 20 flops per interaction

 

9zvxvc.png

 

^_^

Link to comment
Share on other sites

:)

 

euh ? 

 

limited at 6144 bodies ? 

 

no screen :)

 

Last login: Fri Nov 20 23:22:36 on ttys000

Mac-Pro-de-gils:~ gils$ /Users/gils/Downloads/Nbody\ CUDA\ only nbody -numbodies=32768

-bash: /Users/gils/Downloads/Nbody CUDA only: is a directory

Mac-Pro-de-gils:~ gils$ /Users/gils/Downloads/Nbody\ CUDA\ only/nbody nbody -numbodies=32768 -benchmark

Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.

-fullscreen       (run n-body simulation in fullscreen mode)

-fp64             (use double precision floating point values for simulation)

-hostmem          (stores simulation data in host memory)

-benchmark        (run benchmark to measure performance) 

-numbodies=<N>    (number of bodies (>= 1) to run in simulation) 

-device=<d>       (where d=0,1,2.... for the CUDA device to use)

-numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)

-compare          (compares simulation results running once on the default GPU and once on the CPU)

-cpu              (run n-body simulation on the CPU)

-tipsy=<file.bin> (load a tipsy model file for simulation)

 

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

 

> Windowed mode

> Simulation data stored in video memory

> Single precision floating point simulation

> 1 Devices used for simulation

GPU Device 0: "Graphics Device" with compute capability 5.2

 

> Compute 5.2 CUDA device: [Graphics Device]

number of bodies = 32768

32768 bodies, total time for 10 iterations: 234.380 ms

= 45.812 billion interactions per second

= 916.239 single-precision GFLOP/s at 20 flops per interaction

Mac-Pro-de-gils:~ gils$ 

 

 
 
916 GFLOP/s ?? 

post-1093405-0-38751100-1448058534_thumb.png

Link to comment
Share on other sites

Diff in GFLOPS between -benchmark (no OpenGL tasks for GPU+CPU) and window : Doing OpenGL speeds down GFLOPS because GPU has a lot of OpenGL tasks beside the gpu computing to do.

Nbody´s main task is benching the compute power of the gpu, so -benchmark (no OpenGL output) shows much better the compute performance.

"GPU Device 0: "Graphics Device" with compute capability 5.2"

Your GTX 950?

Link to comment
Share on other sites

 Share

×
×
  • Create New...