Cuda 6

well, googled for cuda 6, and found:

Key features of CUDA 6:
· Unified Memory – Simplifies programming by enabling applications to access CPU and GPU memory without the need to manually copy data from one to the other, and makes it easier to add support for GPU acceleration in a wide range of programming languages.

· Drop-in Libraries – Automatically accelerates applications’ BLAS and FFTW calculations by up to 8X by simply replacing the existing CPU libraries with the GPU-accelerated equivalents.
· Multi-GPU Scaling – Re-designed BLAS and FFT GPU libraries automatically scale performance across up to eight GPUs in a single node, delivering over nine teraflops of double precision performance per node, and supporting larger workloads than ever before (up to 512GB). Multi-GPU scaling can also be used with the new BLAS drop-in library.

so, … ehmm, don’t listen to me…
i don’t know anything…

Mantle is completely unrelated. Mantle is a lower level graphics API, below Direct3D or OpenGL. In fact, Mantle is closer to the hardware where CUDA 6 is a step towards more abstraction.

So where are the CUDA gurus.

As far as I understand it, V.6 does nothing for the memory limitation of the GPU, so the enthusiastic ones better sit down again :wink:

It’s an unified addressing and the way I see it, it removes all the nasty code to malloc device memory and copy the data from host to device. You can simpy access RAM via unified addressing and the CUDA library does the rest.

No new CUDA can bypass the “puny” (compared to on-device memory operations) bandwidth of the system bus. So if you want fast rendering all that’s needed has to be in the VRAM before starting. Every access to host data means to sync the busses, copy the data over the slow system bus, and continue.
And you can’t fill a full cup, so the worst case scenario is that you have to start to flush, buffer and copy to handle the memory in a streaming way and that would break a knee of GPU raytracing speedwise.

Any insights to confirm or refute that?

@mib2berlin
Thanks for the answer… I was quite ready to read this, in fact… Well, shall see if that NVidia will do anything for anything else then.

Unified Memory – Simplifies programming by enabling applications to access CPU and GPU memory without the need to manually copy data from one to the other, and makes it easier to add support for GPU acceleration in a wide range of programming languages.

another article about it.

Not to derail the thread, but what ever happened to OpenCL integration? My video card doesn’t support CUDA and it seems like in one of the early (or test) releases of Cycles I was able to get OpenCL to work, even though it was pretty much the same as using the clay render option.

There are literally dozens of threads on the same exact topic. Please do a forum search.

From what I can see that is exactly right. To the end user there will be no difference to how things work now, the advantage will be better memory handling for the cuda developers. The big change will be with the Maxwell architecture which, to take advantage of it, will require a new card.

Interesting, yet my thought on how it could work is different… instead of a CPU having to access memory and transfer it to the GPU for processing in bulk, the GPU (driver) will instead access the system memory (via the CPU) and then perform calculations on it and then pass the results back to the system memory (via the CPU)… yes there is still the overhead of accessing/transfering the data over the system bus but it would be done “behind the scenes” in the cuda driver and not in specific code within the application.

Once this is in place and working “in software/driver” it would allow for future hardware changes to be made that could do all the magic without having to specifically hit the cpu (actually I’m sure this already exists as cards can be allocated system memory directly, if I remember correctly, infact I seem to recall some GPU’s didn’t even have memory but were allocated a chunk of system memory?) in some new and improved way, such as varable amounts of memory instead of fixed chunks or perhaps a new system bus that could be directly controled by the GPU which would allocate/deallocate shared memory without ever having to hit the cpu, instead the cpu would see the memory as “marked shared” and some other type of register(s) to say if the memory is “exclusive CPU update allowed” “exclusive GPU update allowed” “in flux” “data processed and will not change” and so on in a single level memory model (i seem to recall is the term? aka ibm’s as/400 and multics) the cpu/gpu/other would behind the scenes decide the best place for the live data depending on various factors, but both would see each others memory as if it were part of its own…

That’s exactly what I said and how it currently works :smiley:

You have to:

  • allocate host memory
  • allocate a pointer to that memory
  • allocate device memory
  • use cudamemcpy with the pointer as parameter.
  • the data is transferred to the device
  • the kernel manipulates the data in the device memory
  • use cudamemcpy to copy the manipulated data back to the host

There are also more complicated ways where kernel blocks can share memory but also only on-device.

The memory controller resides in the CPU (since the Core-i also for Intel) and you always have to go over the system bus and the CPU if you want data from the host RAM - and its slow.
And those cudamemcpy are the ones slowing down everything. That’s why you want everything on device that you need before you start, then manipulate the data lightning fast, and afterwards only transfer the manipulated data back on host.

And my guess, without further research is, like you say, it’s simplified/unified.

  • allocate host memory (variable/array…)
  • call kernel function with array
    done.
    the rest from above is handled by CUDA6, relieving the programmer of allocating and freeing memory and crating pointers and copying memory blocks around.

NVIDIA released some more information on Unified Memory in CUDA6. It will be supported on Kepler (6xx series).

Also, the OpenCL 2.0 spec has been officially released, bringing its feature set closer to parity with CUDA 6. Whether its features will actually get implemented for any current-gen hardware remains to be seen, however.

Thanks, that’s highly interesting and pretty much confirms my guesstimation.

While i can hear Brecht’s sigh of relief especially with linked lists and deep copies, I doubt though it’ll find it’s way into Blender anytime soon with the “limitation” to Kepler+.

Cycles is already designed with the GPU limitations in mind, so there’s little sharing of data structures. However, it certainly would be worth having a CUDA version that allocates data in unified memory, that would be a somewhat small effort. Blender already ships kernels for different GPU architecures, anyway.

Slightly off topic but really would like some help. Spent forever searching…but no luck. If I try to install a newer version of Cuda Toolkit (have 3.0…which works with Cycles…I’m running an overclocked GTX 660) it says it requires Microsoft Visual Studio (not free by any means)…whats up…sounds like the newer versions are an improvement …yes?..necessary for best performance ? Any answers would be a great help. Thanks

You can get Visual Studio Express, free and it works just fine

Thank you… CUDA 5.5 needs VS2012…not 2013 which I tried first…

Hi mib, is there CUDA 6 compiled in 2.70 ?

We are still on CUDA 5.0, and will also use that for Blender 2.70.

Are there any un-offical / graphicall builds on Cuda 6 ? i found none yet
any time horizon to start using Cuda 6 ?