Setting up a multi-GPU cluster for Cycles rendering

William · September 3, 2015, 5:05am

Not sure if this is the right forum.

Currently I use a computer with two CUDA cards for rendering. I’s plenty fast for rendering images, but it still takes ages to render animations. I’ve used online farms in the past, but the fee builds up over time.

So, I’m interested in building a multi-GPU cluster for rendering. Anyone know how I would do this? Could I somehow attach a cluster of, say, 10 CUDA GPUs to my main computer and access them all via Blender/CUDA, just as I now use two GPUs?

mib2berlin · September 3, 2015, 5:44am

Hi, this should be possible with Lokirender or even with the included netrender addon.
I have read in Octane forum about a 8 GPU limit in Windows 7/8, dunno about Win 10.wise I woul
It is may better to build two 4-5 GPU systems because 3000 Watt power supply are hard to find/expensive or you need a special case for two supply.
Budget wise I would go for GTX 970 4GB or if you have a lot of money GTX 980Ti 6GB.

Cheers, mib

myclay · September 3, 2015, 5:56am

you would need/want/have to use riser cables to extend the PCIe lanes to fit some GPUs on it.

there is also an option to have y riser cards which multiply the lane with teh expense on speed of transferring files to the cards (which usually plays only a marginal time…)
http://www.ameri-rack.com/ARC2-PELY423-C7_m.html

you can also read a lot on the octane forums about such builds.
btw Octane has/had a 12 gpu limit…

Cycles should be even better.

vilvei · September 3, 2015, 6:03am

It’s not really cheep to set many gpus on signle motherboard. Yes, the powersupply needs to be heavy. Also motherboard needs to have the ports and space for them. But also you need to have a processor, that has enough pcie lanes to handle the pcie data traffic.

Normally Intel processors have 16 lanes. If you have 2 gpus, then you have 8 lanes for both and that’s fine. But when you add third, the gpus drops to x4 mode and may not render as fast you wish, because the pcie data bottleneck.

For example http://ark.intel.com/products/82932/Intel-Core-i7-5820K-Processor-15M-Cache-up-to-3_60-GHz has 28 lanes. You can set 3 gpus into x8 mode.

mib2berlin · September 3, 2015, 7:26am

Hi, PCIe performance does not influence renderspeed only loading time.
Render engines work fine with 4x speed but 1x seams to slow for big scenes, only 500 MB/s in theory.
@vilvei, good to know, CPU is fine for 6 GPU with 4x.

Cheers, mib

Fweeb · September 3, 2015, 8:31am

Also note that there’s a point of diminishing returns. One of the machines I render on is a beast with 4 CUDA devices. I have a scene that renders a frame on a single GPU in roughly 45 seconds. However, if I render that frame across all 4 GPUs, render time bumps up to nearly a full minute. This is because once each render tile is finished on the GPU, they must still be sent back to the CPU and stitched together as a single image. That transfer and stitching time turns out to be pretty expensive, especially on frames with short render times. And I’d imagine that processing cost grows with each tile that needs to be added to the stitch.

Of course, there is a work-around. In my case, I have a batch script that launches four simultaneous instances of Blender, each working on the same file, but with only one GPU assigned to it. Granted, if each frame takes more than a couple minutes to render, then the stitching cost is less significant of a worry.

titipuchal · September 3, 2015, 1:44pm

You could find used s1070 Tesla servers from ebay (150 USD) they can hold 4 full size m2090 (fermi) cards (+/- 250 USD), or if you are handy manny can fit some geforce cards. The Tesla servers must be attached to the host with an HIC (host interface card) look for the two connections card (100 usd). if you have 3 pci-ex 8/16 slots available you can attach 3 servers, that means 12 tesla cards. This systems are noisy as hell, so buy some 2m cables (50 usd) and put the servers in other room.

The same thing can be done with quadro-plex units, just search in ebay.

vilvei · September 3, 2015, 8:56pm

My rant about pcie lanes was based on my own experience with particle hair. I have two (identical) GTX 750 ti cards. When I bought the second one, my old motherboard did not share the modes, but set the first card to be x16 and the second x4. Everything seemed to be fine, before I was rendering a scene with particle hair. The first one run like before, but the second card was about 10 times slower. The first card made about all the tiles and was waiting the second to finish. It didn’t matter if I chose to render with dual or just by one gpu, the second was always slower, and only I could think of was to do with particle hair and x4 mode combination. I bought a better motherboard that promised to set 3 gpus on x16 mode if otherwise possible, and now they are both on x8, and the same particle hair is now rendering the same speed on both cards, and the same as before on mode x16.

Or perhaps that mobo just was a really crappy one, and the modes and pcie lanes did have nothing to do with it? But any case, beware of some random mobos.

LordOdin · September 6, 2015, 8:20am

Fweeb:

Also note that there’s a point of diminishing returns. One of the machines I render on is a beast with 4 CUDA devices. I have a scene that renders a frame on a single GPU in roughly 45 seconds. However, if I render that frame across all 4 GPUs, render time bumps up to nearly a full minute. This is because once each render tile is finished on the GPU, they must still be sent back to the CPU and stitched together as a single image. That transfer and stitching time turns out to be pretty expensive, especially on frames with short render times. And I’d imagine that processing cost grows with each tile that needs to be added to the stitch.

Of course, there is a work-around. In my case, I have a batch script that launches four simultaneous instances of Blender, each working on the same file, but with only one GPU assigned to it. Granted, if each frame takes more than a couple minutes to render, then the stitching cost is less significant of a worry.

Do you have different cards?

If you use the same exact card it should be almost perfect division not including bvh time. 1 card = 1 min 2 cards = ~30 seconds 4 cards = ~15 seconds

That has been my experience but I have never used more than 2 cards… Before I got these 2 970s I used a 560 ti and a 460 and sometimes the 460 would set there way after the 560 ti was done even though they are very very close in cycles performance

Fweeb · September 6, 2015, 7:02pm

Same cards. This is an issue specific to HD renders that take less than a minute per frame. For renders that short, the reassembly time for tiles is non-trivial and the four-instances solution is the better way to go.

mtbiker · October 5, 2023, 1:22pm

Fweeb please share these script, it is still actual nowadays

Fweeb · October 5, 2023, 3:20pm

I have it up as a Gist on Github. I really should make this a blog post on my website, though.