Jump to content

CUDA-Z Info+Bench (Nvidia only) - updated Dec 2015


mitch_de
 Share

86 posts in this topic

Recommended Posts

  • 2 weeks later...

Hi all. My GTX 295 Co-op edition is up and running in Snow Leopard 10.6.8 using Apple's native drivers in 10.6.8 and the OpenGL framework. There are 2 GPUs on this card, therefore x2 the results.

 

I added a post to a thread on insanelymac to give solution for GTX 295 owners who want to update to 10.6.8:

 

http://www.insanelymac.com/forum/index.php...p;#entry1716804

 

Basically, there are changes made in 10.6.8 which render the existing EFI string useless unless 2 extensions are installed in /Extra/Extensions.

 

The following is the result of combining both GPUs on the GTX 295 card for the N-Body demo in CUDA 4.0.

 

./nbody -numdevices=2 -n=61440 -benchmark

[nbody] starting...
Run "nbody -benchmark [-n=<numBodies>]" to measure perfomance.
-fullscreen (run n-body simulation in fullscreen mode)
-fp64	   (use double precision floating point values for simulation)
-numdevices=N (use first N CUDA devices for simulation)

> Windowed mode
> Simulation data stored in system memory
> Single precision floating point simulation
> 2 Devices used for simulation
> Compute 1.3 CUDA device: [GeForce GTX 295]
> Compute 1.3 CUDA device: [GeForce GTX 295]

61440 bodies, total time for 10 iterations: 697.011 ms
= 54.158 billion interactions per second
= 1083.161 single-precision GFLOP/s at 20 flops per interaction

[nbody] test results...
PASSED

post-217574-1311046891_thumb.png

post-217574-1311046923_thumb.png

post-217574-1311046940_thumb.png

post-217574-1311047963_thumb.png

Link to comment
Share on other sites

Great nbody result!

 

Here my 9600GT:

 

nbody -n=61440 -benchmark

[nbody] starting...

Run "nbody -benchmark [-n=]" to measure perfomance.

-fullscreen (run n-body simulation in fullscreen mode)

-fp64 (use double precision floating point values for simulation)

-numdevices=N (use first N CUDA devices for simulation)

> Windowed mode

> Simulation data stored in video memory

> Single precision floating point simulation

> 1 Devices used for simulation

> Compute 1.1 CUDA device: [GeForce 9600 GT]

61440 bodies, total time for 10 iterations: 4812.126 ms

= 7.845 billion interactions per second

= 156.890 single-precision GFLOP/s at 20 flops per interaction

[nbody] test results...

PASSED

GA_EP35:nbody ami$

 

 

For others: compiled nbody attatched!

use nbody -n=61440 -benchmark for comparing our results

without parameter -benchmark you will get an window which shows what happens (simulated)

I added the nbody resuls and nbody to first posting.

Happy nbody benching - which is much more usefull than the small (bandwithtest) CUDA-Z performance values.

nbody.zip

Link to comment
Share on other sites

GTX 460 results :

[nbody] starting...

Run "nbody -benchmark [-n=<numBodies>]" to measure perfomance.

-fullscreen (run n-body simulation in fullscreen mode)

-fp64 (use double precision floating point values for simulation)

-numdevices=N (use first N CUDA devices for simulation)

 

> Windowed mode

> Simulation data stored in video memory

> Single precision floating point simulation

> 1 Devices used for simulation

> Compute 2.1 CUDA device: [GeForce GTX 460]

61440 bodies, total time for 10 iterations: 2008.823 ms

= 18.791 billion interactions per second

= 375.829 single-precision GFLOP/s at 20 flops per interaction

[nbody] test results...

PASSED

Link to comment
Share on other sites

My EVGA GTX 295 Co-op edition is over-clocked and has been running like that since several years now. The settings has allowed it to match a GTX 285 (times 2) in CUDA performance. I'm going to restore the 2 ROMs (one per GPU) back to the factory settings and report back the N-body results for the standard clocks.

 

Using the factory clocks, the N-body results against both GPUs running simultaneously are: (via -numdevices=2)

 

./nbody -n=61440 -numdevices=2 -benchmark
[nbody] starting...
Run "nbody -benchmark [-n=<numBodies>]" to measure perfomance.
-fullscreen (run n-body simulation in fullscreen mode)
-fp64	   (use double precision floating point values for simulation)
-numdevices=N (use first N CUDA devices for simulation)

> Windowed mode
> Simulation data stored in system memory
> Single precision floating point simulation
> 2 Devices used for simulation
> Compute 1.3 CUDA device: [GeForce GTX 295]
> Compute 1.3 CUDA device: [GeForce GTX 295]
61440 bodies, total time for 10 iterations: 847.679 ms
= 44.532 billion interactions per second
= 890.638 single-precision GFLOP/s at 20 flops per interaction
[nbody] test results...
PASSED

 

And the N-body result against one GPU on the GTX 295:

 

./nbody -n=61440 -benchmark
[nbody] starting...
Run "nbody -benchmark [-n=<numBodies>]" to measure perfomance.
-fullscreen (run n-body simulation in fullscreen mode)
-fp64	   (use double precision floating point values for simulation)
-numdevices=N (use first N CUDA devices for simulation)

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
> Compute 1.3 CUDA device: [GeForce GTX 295]
61440 bodies, total time for 10 iterations: 1687.764 ms
= 22.366 billion interactions per second
= 447.322 single-precision GFLOP/s at 20 flops per interaction
[nbody] test results...
PASSED

Link to comment
Share on other sites

10.6.7 : CUDA 4.0.17 : Driver 256.02.05.f1

 

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
> Compute 2.0 CUDA device: [GeForce GTX 470]
61440 bodies, total time for 10 iterations: 1393.516 ms
= 27.089 billion interactions per second
= 541.777 single-precision GFLOP/s at 20 flops per interaction
[nbody] test results...
PASSED

 

Soon test 10.6.8 or 10.7 with updated drivers.

 

TY

Link to comment
Share on other sites

One can even benchmark double-precision via the N-Body CUDA demo. GTX 4xx owners will be able to see a much larger increase in performance when comparing to GTX 2xx cards.

 

Here are my results:

 

./nbody -fp64 -n=30720 -benchmark

 

GTX 295 OC'ed:

 

-- One GPU: 66.966 double-precision GFLOP/s

-- Two GPUs via -numdevices=2: 128.672 double-precision GFLOP/s

 

GTX 295 (standard clocks):

 

-- One GPU: 55.010 double-precision GFLOP/s

-- Two GPUs via -numdevices=2: 106.478 double-precision GFLOP/s

 

This is where a GTX 4xx can shine over a GTX 2xx variant.

 

Try -n=15360 if -n=30720 reports unspecified launch failure.

Not all cards support double-precision.

Link to comment
Share on other sites

Lion is soon out and folks will be able to benchmark GTX 5xx series.

 

The following is taken from my Windows 7 box running a pair of GTX 560's. They came factory OC'd at 900/1800/2004 (4008) 1.012 volts. However, I under-clocked/under-volted them down to 855/1710/2100 (4200) 0.987 volts.

 

Single-Precision: ./nbody -n=61440 -benchmark

 

-- One GPU: 548.804 single-precision GFLOP/s

-- Two GPUs via -numdevices=2: 1068.541 single-precision GFLOP/s

 

Double-Precision: ./nbody -fp64 -n=30720 -benchmark

 

-- One GPU: 89.702 double-precision GFLOP/s

-- Two GPUs via -numdevices=2: 166.125 double-precision GFLOP/s

 

 

I wish that NVIDIA will one day make a single-PCB card containing 2 GTX 560's to have a good balance for compute-power and electric power utilization.

Link to comment
Share on other sites

  • 5 months later...

When I run nbody I get the following:

dyld: Library not loaded: @rpath/libcudart.dylib
 Referenced from: /Users/lord_jeremy/Applications/./nbody
 Reason: image not found

I've got the nVidia CUDA package 4.1.25 installed and CUDA-Z shows benchmark info so I presume it's functioning correctly. Any thoughts?

Link to comment
Share on other sites

  • 6 months later...
  • 1 month later...

Performance Tab : Device to Device Speed shows VRAM Speed.

In your case, GT 520 , you can see that that value is not good, because of limited (Bits) vram bandwidth of GT 520 (GT 420, GT 620) and others using 64/128 Bit instead of 256/384 Bit.

Host to Device or Device to host values (VRAM copy/access over PCI-E) are mostly limited by PCI-E Speed and way less than transferspeed onboard VRAM.

Link to comment
Share on other sites

  • 2 years later...
 Share

×
×
  • Create New...