Cuda 6

11131 · November 15, 2013, 11:48am

well, googled for cuda 6, and found:

Key features of CUDA 6:
· Unified Memory – Simplifies programming by enabling applications to access CPU and GPU memory without the need to manually copy data from one to the other, and makes it easier to add support for GPU acceleration in a wide range of programming languages.

· Drop-in Libraries – Automatically accelerates applications’ BLAS and FFTW calculations by up to 8X by simply replacing the existing CPU libraries with the GPU-accelerated equivalents.
· Multi-GPU Scaling – Re-designed BLAS and FFT GPU libraries automatically scale performance across up to eight GPUs in a single node, delivering over nine teraflops of double precision performance per node, and supporting larger workloads than ever before (up to 512GB). Multi-GPU scaling can also be used with the new BLAS drop-in library.

so, … ehmm, don’t listen to me…
i don’t know anything…

skw · November 15, 2013, 1:12pm

Mantle is completely unrelated. Mantle is a lower level graphics API, below Direct3D or OpenGL. In fact, Mantle is closer to the hardware where CUDA 6 is a step towards more abstraction.

arexma · November 15, 2013, 1:34pm

So where are the CUDA gurus.

As far as I understand it, V.6 does nothing for the memory limitation of the GPU, so the enthusiastic ones better sit down again

It’s an unified addressing and the way I see it, it removes all the nasty code to malloc device memory and copy the data from host to device. You can simpy access RAM via unified addressing and the CUDA library does the rest.

No new CUDA can bypass the “puny” (compared to on-device memory operations) bandwidth of the system bus. So if you want fast rendering all that’s needed has to be in the VRAM before starting. Every access to host data means to sync the busses, copy the data over the slow system bus, and continue.
And you can’t fill a full cup, so the worst case scenario is that you have to start to flush, buffer and copy to handle the memory in a streaming way and that would break a knee of GPU raytracing speedwise.

Any insights to confirm or refute that?

almux · November 16, 2013, 7:31am

@mib2berlin
Thanks for the answer… I was quite ready to read this, in fact… Well, shall see if that NVidia will do anything for anything else then.

holyenigma74 · November 16, 2013, 7:44am

Unified Memory – Simplifies programming by enabling applications to access CPU and GPU memory without the need to manually copy data from one to the other, and makes it easier to add support for GPU acceleration in a wide range of programming languages.

holyenigma74 · November 16, 2013, 8:02am

another article about it.

dwmitch · November 16, 2013, 11:24am

Not to derail the thread, but what ever happened to OpenCL integration? My video card doesn’t support CUDA and it seems like in one of the early (or test) releases of Cycles I was able to get OpenCL to work, even though it was pretty much the same as using the clay render option.

m9105826 · November 16, 2013, 1:23pm

There are literally dozens of threads on the same exact topic. Please do a forum search.

Grimm · November 16, 2013, 2:02pm

arexma:

So where are the CUDA gurus.

As far as I understand it, V.6 does nothing for the memory limitation of the GPU, so the enthusiastic ones better sit down again

It’s an unified addressing and the way I see it, it removes all the nasty code to malloc device memory and copy the data from host to device. You can simpy access RAM via unified addressing and the CUDA library does the rest.

No new CUDA can bypass the “puny” (compared to on-device memory operations) bandwidth of the system bus. So if you want fast rendering all that’s needed has to be in the VRAM before starting. Every access to host data means to sync the busses, copy the data over the slow system bus, and continue.
And you can’t fill a full cup, so the worst case scenario is that you have to start to flush, buffer and copy to handle the memory in a streaming way and that would break a knee of GPU raytracing speedwise.

Any insights to confirm or refute that?

From what I can see that is exactly right. To the end user there will be no difference to how things work now, the advantage will be better memory handling for the cuda developers. The big change will be with the Maxwell architecture which, to take advantage of it, will require a new card.

JonathanWilson · November 17, 2013, 6:33am

Interesting, yet my thought on how it could work is different… instead of a CPU having to access memory and transfer it to the GPU for processing in bulk, the GPU (driver) will instead access the system memory (via the CPU) and then perform calculations on it and then pass the results back to the system memory (via the CPU)… yes there is still the overhead of accessing/transfering the data over the system bus but it would be done “behind the scenes” in the cuda driver and not in specific code within the application.

Once this is in place and working “in software/driver” it would allow for future hardware changes to be made that could do all the magic without having to specifically hit the cpu (actually I’m sure this already exists as cards can be allocated system memory directly, if I remember correctly, infact I seem to recall some GPU’s didn’t even have memory but were allocated a chunk of system memory?) in some new and improved way, such as varable amounts of memory instead of fixed chunks or perhaps a new system bus that could be directly controled by the GPU which would allocate/deallocate shared memory without ever having to hit the cpu, instead the cpu would see the memory as “marked shared” and some other type of register(s) to say if the memory is “exclusive CPU update allowed” “exclusive GPU update allowed” “in flux” “data processed and will not change” and so on in a single level memory model (i seem to recall is the term? aka ibm’s as/400 and multics) the cpu/gpu/other would behind the scenes decide the best place for the live data depending on various factors, but both would see each others memory as if it were part of its own…

arexma · November 17, 2013, 3:52pm

That’s exactly what I said and how it currently works

You have to:

allocate host memory
allocate a pointer to that memory
allocate device memory
use cudamemcpy with the pointer as parameter.
the data is transferred to the device
the kernel manipulates the data in the device memory
use cudamemcpy to copy the manipulated data back to the host

There are also more complicated ways where kernel blocks can share memory but also only on-device.

The memory controller resides in the CPU (since the Core-i also for Intel) and you always have to go over the system bus and the CPU if you want data from the host RAM - and its slow.
And those cudamemcpy are the ones slowing down everything. That’s why you want everything on device that you need before you start, then manipulate the data lightning fast, and afterwards only transfer the manipulated data back on host.

And my guess, without further research is, like you say, it’s simplified/unified.

allocate host memory (variable/array…)
call kernel function with array
done.
the rest from above is handled by CUDA6, relieving the programmer of allocating and freeing memory and crating pointers and copying memory blocks around.

Zalamander · November 19, 2013, 12:47pm

NVIDIA released some more information on Unified Memory in CUDA6. It will be supported on Kepler (6xx series).

Also, the OpenCL 2.0 spec has been officially released, bringing its feature set closer to parity with CUDA 6. Whether its features will actually get implemented for any current-gen hardware remains to be seen, however.

arexma · November 19, 2013, 3:09pm

Thanks, that’s highly interesting and pretty much confirms my guesstimation.

While i can hear Brecht’s sigh of relief especially with linked lists and deep copies, I doubt though it’ll find it’s way into Blender anytime soon with the “limitation” to Kepler+.

Zalamander · November 20, 2013, 4:05am

Cycles is already designed with the GPU limitations in mind, so there’s little sharing of data structures. However, it certainly would be worth having a CUDA version that allocates data in unified memory, that would be a somewhat small effort. Blender already ships kernels for different GPU architecures, anyway.

Frank_the_Smith · December 2, 2013, 8:19am

Slightly off topic but really would like some help. Spent forever searching…but no luck. If I try to install a newer version of Cuda Toolkit (have 3.0…which works with Cycles…I’m running an overclocked GTX 660) it says it requires Microsoft Visual Studio (not free by any means)…whats up…sounds like the newer versions are an improvement …yes?..necessary for best performance ? Any answers would be a great help. Thanks

m9105826 · December 2, 2013, 8:47am

You can get Visual Studio Express, free and it works just fine

Frank_the_Smith · December 2, 2013, 9:25am

Thank you… CUDA 5.5 needs VS2012…not 2013 which I tried first…

zajooo · February 18, 2014, 11:23am

Hi mib, is there CUDA 6 compiled in 2.70 ?

DingTo · February 18, 2014, 11:39am

We are still on CUDA 5.0, and will also use that for Blender 2.70.

zajooo · February 18, 2014, 12:20pm

Are there any un-offical / graphicall builds on Cuda 6 ? i found none yet
any time horizon to start using Cuda 6 ?